Related
I am trying to split the following list:
x <- c(1, 19, 25, 62, 38, 41, 52, 53, 60, 61, 1, 74, 72, 66, 1, 68, 5, 1)
What I would like to do is split the above using the number 1 as the break points.
x1 <- c(1, 19, 25, 62, 38, 41, 52, 53, 60, 61)
x2 <- c(1, 74, 72, 66)
x3 <- c(1, 68, 5)
There must be a simple method to use but I am drawing a blank and my search-fu is weak and coming up empty.
Thanks for your help.
Use split with cumsum:
x <- c(1, 19, 25, 62, 38, 41, 52, 53, 60, 61, 1, 74, 72, 66, 1, 68, 5, 1)
split(x, f=cumsum(x==1))
#> $`1`
#> [1] 1 19 25 62 38 41 52 53 60 61
#>
#> $`2`
#> [1] 1 74 72 66
#>
#> $`3`
#> [1] 1 68 5
#>
#> $`4`
#> [1] 1
I have a data frame that has rows that represent communities. For columns, the first column is the group that the community falls into (a total of 6 groups) and the remaining 8 are IDs of each member of the community.
What I would like to do is have a community (row) within groups 1, 3, and 5 to be picked where there is no overlap between them. Then, once I have that - I would like to pick a community from groups 2, 4, and 6 where there is no more than 25% overlap between the selected 6 total communities.
Here is an example dataset:
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
Based on the criteria I mentioned above, the following could be pulled out:
Group 1: 125, 8, 40, 127, 19, 33, 29, 3
Group 3: 11, 25, 126, 22, 56, 4, 6, 52
Group 5: 5, 63, 18, 48, 37, 32, 43, 1
Group 2: 25, 37, 8, 38, 40, 124, 32, 56
Group 4: 125, 15, 29, 4, 48, 5, 128, 11
Group 6: 34, 23, 33, 32, 63, 22, 19, 56
I believe this might be helpful (please let me know if not!).
The first step would be to subset your data into Group 1, 3, and 5. Then using transpose from purrr, splitting by Group, with cross we can get all combinations selecting one row from each group.
library(purrr)
grp_135 <- df[df$Group %in% c(1, 3, 5), ]
all_combn_135 <- lapply(cross(split(transpose(grp_135), grp_135$Group)), bind_rows)
Checking the first element to see what we have:
R> all_combn_135[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 29 4 3 63 125 40 32 38
3 5 5 63 18 48 37 32 43 1
Next, we can check for overlap by counting duplicates. In this case, I just unlist the three rows, use table for frequency, and sum up (subtracting 1 for each value found, since only want duplicates).
combn_ovlp_135 <- lapply(all_combn_135, function(x) {
sum(table(unlist(x[-1])) - 1)
})
The ones without overlap can be obtained by:
no_ovlp <- all_combn_135[combn_ovlp_135 == 0]
no_ovlp
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 11 25 126 22 56 4 6 52
3 5 5 63 18 48 37 32 43 1
For the next part, do something similar (this can be broken out as a generalized function), except when checking for overlap, combine elements with the first no_ovlp from previously:
grp_246 <- df[df$Group %in% c(2, 4, 6), ]
all_combn_246 <- lapply(cross(split(transpose(grp_246), grp_246$Group)), bind_rows)
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) - 1) / ((ncol(df) - 1) * 6)
})
It is not entirely clear how you want to calculate overlap for this part and compare with 25%. I counted duplicates and then divided by the number of columns (8 not counting Group) and multiply by 6 (rows). To see which combination of Group 2, 4, and 6 could be combined with no_ovlp you can try the following:
all_combn_246[combn_ovlp_246 < .25]
In my case, I believe none of the combinations met this criterion, although the first with 37.5% overlap was the minimum:
R> all_combn_246[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 25 37 8 38 40 124 32 56
2 4 125 15 29 4 48 5 128 11
3 6 34 23 33 32 63 22 19 56
What was unclear is how to count duplicates. For example, how much overlap is c(1, 2, 3, 3, 3)?
This could be two duplicates (two extra 3's):
R> sum(table(x) - 1)
[1] 2
Or you could count number of values that have any duplicates (just the number 3 is duplicated):
R> sum(table(x) > 1)
[1] 1
If it is the latter, you could try:
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) > 1) / ((ncol(df) - 1) * 6)
})
By shamelessly stealing Ben's use of cross(), I have this approach that I personally find easier to read:
# Returns the number of overlapping elements
overlap <- function(xx){
length(unlist(xx)) - length(unique(unlist(xx)))
}
df_135 <- df %>%
as_tibble() %>%
filter(Group %in% c(1,3,5)) %>%
group_by(Group) %>%
mutate(Community = row_number()) %>%
nest(Members = starts_with("Isol_")) %>%
mutate(Members = map(Members, as.integer))
df_135
# A tibble: 12 x 3
# Groups: Group [3]
# Group Community Members
# <dbl> <chr> <list>
# 1 1 g1_1 <int [8]>
# 2 1 g1_2 <int [8]>
# 3 1 g1_3 <int [8]>
# 4 1 g1_4 <int [8]>
# 5 3 g3_1 <int [8]>
# 6 3 g3_2 <int [8]>
# 7 3 g3_3 <int [8]>
# 8 3 g3_4 <int [8]>
# 9 5 g5_1 <int [8]>
#10 5 g5_2 <int [8]>
#11 5 g5_3 <int [8]>
#12 5 g5_4 <int [8]>
# Compute all combinations across groups
all_combns <- cross(split(df_135$Members, df_135$Group))
# select the combinations with the desired overlap
all_combns[map_int(all_combns, overlap) == 0]
# [[1]]
# [[1]]$`1`
# [1] 125 8 40 127 19 33 29 3
#
# [[1]]$`3`
# [1] 11 25 126 22 56 4 6 52
#
# [[1]]$`5`
# [1] 5 63 18 48 37 32 43 1
Here's a plain R solution. It's not the most efficient one, but it's very straight forward and therefor very tractable.
The code below collects all the values in group 1 (1,3,5) and group 2 (2,4,6), and samples n isolates from this list. It then tests for the minimal overlap and resamples group 2 if necessary. In the case of your request, it only needs to resample once or twice, but if your threshold is lower (e.g. 0.05), it may resample up to 50 times before it gets it right. In fact, if your threshold is too low and your number of samples too large (i.e. it is impossible to make this sample), it will warn you that it failed.
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
df = as.data.frame(df)
subset1 <- df[df$Group %in% c(1,3,5),]
subset2 <- df[df$Group %in% c(2,4,6),]
values_in_subset1 <- subset1[2:ncol(subset1)] # Drop group column
values_in_subset1 <- as.vector(t(values_in_subset1)) # Convert to single vector
values_in_subset2 <- subset2[2:ncol(subset2)] # Drop group column
values_in_subset2 <- as.vector(t(values_in_subset2)) # Convert to single vector
n_sampled <- 8
sample1 <- sample(values_in_subset1, n_sampled, replace=F) #Replace=F is default, added here for readability
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
min_percentage_overlap <- 0.25
retries <- 1
# Retry until it gets it right
while(percentage_overlap > min_percentage_overlap && retries < 1000)
{
retries <- retries + 1
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
}
# Report on number of attempts
cat(paste("Sampled", retries, "times to make sure there was less than", min_percentage_overlap*100,"% overlap."))
# Finally, check if it worked.
if(percentage_overlap <= min_percentage_overlap){
cat("It's super effective! (not really though)")
} else {
cat("But it failed!")
}
This question already has answers here:
Can dplyr package be used for conditional mutating?
(5 answers)
Closed 2 years ago.
I want to mutate a column A4 by A3 but reducing value of A3 by 1 if Total == 63. What am I doing wrong here?
tb1 %>%
mutate(A4 = replace(A3, Total == 63, A3-1))
The complete code with data is here
library(tidyverse)
tb1 <-
structure(
list(
A1 = c(16, 11, 16, 18, 20, 19, 16, 18, 20, 15,
17, 19, 19, 19, 16, 19, 16, 15, 19, 19, 16, 18, 18, 19, 19, 18,
20, 18, 19, 19, 19, 19, 17, 19, 17, 16, 18, 19, 16, 18, 17, 19,
19, 20, 17, 16, 18, 16, 15, 19, 19, 17, 20, 18, 16, 19, 19, 15,
17, 17, 19, 19, 16, 17, 18, 19, 17, 19, 17, 15, 19, 16, 17
)
, A2 = c(8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
)
, A3 = c(33, 34, 38, 36, 36, 34, 41, 36, 40, 38, 38, 41, 38, 34, 33, 36,
41, 40, 41, 38, 41, 33, 40, 38, 40, 38, 41, 41, 40, 41, 40,
38, 34, 40, 36, 41, 40, 40, 33, 38, 36, 41, 40, 40, 28, 41,
40, 41, 33, 41, 36, 36, 40, 34, 41, 41, 38, 38, 41, 38, 41,
41, 36, 40, 38, 38, 40, 41, 38, 22, 36, 34, 38
)
, Total = c(57, 53, 62, 62, 64, 61, 65, 62, 68, 61, 63, 68, 65, 61, 57, 63,
65, 63, 68, 65, 65, 59, 66, 65, 67, 64, 69, 67, 67, 68, 67,
65, 59, 67, 61, 65, 66, 67, 57, 64, 61, 68, 67, 68, 53, 65,
66, 65, 56, 68, 63, 61, 68, 60, 65, 68, 65, 61, 66, 63, 68,
68, 60, 65, 64, 65, 65, 68, 63, 45, 63, 58, 63
)
)
, class = "data.frame"
, row.names = c(NA, -73L)
)
tb1 %>%
filter(Total == 63)
#> A1 A2 A3 Total
#> 1 17 8 38 63
#> 2 19 8 36 63
#> 3 15 8 40 63
#> 4 19 8 36 63
#> 5 17 8 38 63
#> 6 17 8 38 63
#> 7 19 8 36 63
#> 8 17 8 38 63
tb2 <-
tb1 %>%
mutate(A4 = replace(A3, Total == 63, A3-1)) %>%
mutate(Total = A1 + A2 + A3)
#> Warning: Problem with `mutate()` input `A4`.
#> x number of items to replace is not a multiple of replacement length
#> ℹ Input `A4` is `replace(A3, Total == 63, A3 - 1)`.
tb2 %>%
filter(Total == 62)
#> A1 A2 A3 Total
#> 1 16 8 38 62
#> 2 18 8 36 62
#> 3 18 8 36 62
You are better using ifelse here :
library(dplyr)
tb1 %>% mutate(A4 = ifelse(Total == 63, A3 -1, A3))
As far as why replace does not work if you check the source code of replace :
replace
function (x, list, values)
{
x[list] <- values
x
}
It assigns values to x after subsetting for list.
When you use :
tb1 %>% mutate(A4 = replace(A3, Total == 63, A3-1))
your values is of length length(tb1$A3) but list is of length sum(tb1$Total == 63) which do not match hence you get the warning of number of items to replace is not a multiple of replacement length, since it tries recycling those values but still the length is unequal.
If you want to make replace work you can try :
tb1 %>% mutate(A4 = replace(A3, Total == 63, A3[Total == 63] -1))
but again as I mentioned it is easier to just use ifelse here.
This question already has an answer here:
Pipe a data frame to a function whose argument pipes a dot
(1 answer)
Closed 3 years ago.
Using piping in R (with %>%), how can one pass specific vector elements from a function's output to feed the next function's arguments?
I've tried using the dot operator with position in braces (i.e., .[1], .[2]) to no avail.
The only way that was working for me was with sapply(), but I'm wondering whether there's a simpler solution I'm missing.
Example
#I have a vector containing a sequence of numbers, with some duplicates and gaps,
#and I want to use its start and end points to create an analogous consecutive sequence.
original_sequence <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,
58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 98, 98, 99, 100, 101,
102, 103, 104, 105, 106, 107, 108, 109, 110)
## unsuccessful attempt #1
original_sequence %>%
range() %>%
seq()
[1] 1 2 ## this result is equivalent to the output of `seq(2)`,
## but what I want is to compute `seq(1 ,110)`.
## unsuccessful attempt #2
original_sequence %>%
range() %>%
seq(.[1]), .[2])
Error: unexpected ',' in:
" range() %>%
seq(.[1]),"
## attempt #3: somewhat successful but I wonder whether there's a better way
original_sequence %>%
range() %>%
sapply(., seq)
[[1]]
[1] 1
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
[39] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
[77] 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
Bottom line -- I was able to do it with sapply but I hope to figure out a solution in the spirit of my second attempt, because it's more handy to know a universal way to cherry-pick the specific vector elements you want to pass to the next function's arguments.
One way would be to use {} and pass input arguments to seq
library(dplyr)
original_sequence %>%
range() %>%
{seq(.[[1]], .[2])}
#[1] 1 2 3 4 5 6 7 8 9 10 11 12......
Or we can mix it with base R do.call
original_sequence %>% range() %>% {do.call(seq, as.list(.))}
Or as #Ozan147 mentioned if your sequence always starts with 1 we can use max
original_sequence %>% max %>% seq
We can use reduce
library(tidyverse)
original_sequence %>%
range %>%
reduce(seq)
#[1] 1 2 3 4 ...
Creating named vector where names are associated to GO id from a csv file did not work.
> head(read.delim("~/GOmapping.tsv", sep = '\t'))
V1 V14
1 sp0000005 GO:0003723
2 sp0000006 GO:0016021
3 sp0000007 GO:0003700,GO:0006355,GO:0043565
4 sp0000016 GO:0046983
5 sp0000017 GO:0004672,GO:0005524,GO:0006468
6 sp0000022 GO:0003677,GO:0046983
> head(read.delim("~/GOmapping.tsv", sep = '\t'))[1]
V1
1 sp0000005
2 sp0000006
3 sp0000007
4 sp0000016
5 sp0000017
6 sp0000022
> head(read.delim("~/GOmapping.tsv", sep = '\t'))[2]
V14
1 GO:0003723
2 GO:0016021
3 GO:0003700,GO:0006355,GO:0043565
4 GO:0046983
5 GO:0004672,GO:0005524,GO:0006468
6 GO:0003677,GO:0046983
> geneID2GO <- read.delim("~/GOmapping.tsv", sep = '\t'))[2]
> geneID2GO <- read.delim("~/GOmapping.tsv", sep = '\t')[2]
> names(geneID2GO) <- read.delim("~/GOmapping.tsv", sep = '\t')[1]
> head(geneID2GO)
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 57, 58, 59, 60, 6 ...
1 GO:0003723
2 GO:0016021
3 GO:0003700,GO:0006355,GO:0043565
4 GO:0046983
5 GO:0004672,GO:0005524,GO:0006468
6 GO:0003677,GO:0046983
What did I miss?
Thank you in advance.
If you want a vector as result, maybe you could try to coerce your values and names (column 1) to character.
data <- read.delim("~/GOmapping.tsv", sep = '\t')
geneID2GO <- as.character(data[,2])
names(geneID2GO) <- as.character(data[,1])
head(geneID2GO)
sp0000005 sp0000006 sp0000007
"GO:0003723" "GO:0016021" "GO:0003700,GO:0006355,GO:0043565"
sp0000016
"GO:0046983"
Alternatively, you can display the result as follows:
cbind(geneID2GO)
geneID2GO
sp0000005 "GO:0003723"
sp0000006 "GO:0016021"
sp0000007 "GO:0003700,GO:0006355,GO:0043565"
sp0000016 "GO:0046983"