Selecting Combinations Across Columns Row by Row With Overlap Threshold

Selecting Combinations Across Columns Row by Row With Overlap Threshold - r

I have a data frame that has rows that represent communities. For columns, the first column is the group that the community falls into (a total of 6 groups) and the remaining 8 are IDs of each member of the community.
What I would like to do is have a community (row) within groups 1, 3, and 5 to be picked where there is no overlap between them. Then, once I have that - I would like to pick a community from groups 2, 4, and 6 where there is no more than 25% overlap between the selected 6 total communities.
Here is an example dataset:
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
Based on the criteria I mentioned above, the following could be pulled out:
Group 1: 125, 8, 40, 127, 19, 33, 29, 3
Group 3: 11, 25, 126, 22, 56, 4, 6, 52
Group 5: 5, 63, 18, 48, 37, 32, 43, 1
Group 2: 25, 37, 8, 38, 40, 124, 32, 56
Group 4: 125, 15, 29, 4, 48, 5, 128, 11
Group 6: 34, 23, 33, 32, 63, 22, 19, 56

I believe this might be helpful (please let me know if not!).
The first step would be to subset your data into Group 1, 3, and 5. Then using transpose from purrr, splitting by Group, with cross we can get all combinations selecting one row from each group.
library(purrr)
grp_135 <- df[df$Group %in% c(1, 3, 5), ]
all_combn_135 <- lapply(cross(split(transpose(grp_135), grp_135$Group)), bind_rows)
Checking the first element to see what we have:
R> all_combn_135[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 29 4 3 63 125 40 32 38
3 5 5 63 18 48 37 32 43 1
Next, we can check for overlap by counting duplicates. In this case, I just unlist the three rows, use table for frequency, and sum up (subtracting 1 for each value found, since only want duplicates).
combn_ovlp_135 <- lapply(all_combn_135, function(x) {
sum(table(unlist(x[-1])) - 1)
})
The ones without overlap can be obtained by:
no_ovlp <- all_combn_135[combn_ovlp_135 == 0]
no_ovlp
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 11 25 126 22 56 4 6 52
3 5 5 63 18 48 37 32 43 1
For the next part, do something similar (this can be broken out as a generalized function), except when checking for overlap, combine elements with the first no_ovlp from previously:
grp_246 <- df[df$Group %in% c(2, 4, 6), ]
all_combn_246 <- lapply(cross(split(transpose(grp_246), grp_246$Group)), bind_rows)
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) - 1) / ((ncol(df) - 1) * 6)
})
It is not entirely clear how you want to calculate overlap for this part and compare with 25%. I counted duplicates and then divided by the number of columns (8 not counting Group) and multiply by 6 (rows). To see which combination of Group 2, 4, and 6 could be combined with no_ovlp you can try the following:
all_combn_246[combn_ovlp_246 < .25]
In my case, I believe none of the combinations met this criterion, although the first with 37.5% overlap was the minimum:
R> all_combn_246[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 25 37 8 38 40 124 32 56
2 4 125 15 29 4 48 5 128 11
3 6 34 23 33 32 63 22 19 56
What was unclear is how to count duplicates. For example, how much overlap is c(1, 2, 3, 3, 3)?
This could be two duplicates (two extra 3's):
R> sum(table(x) - 1)
[1] 2
Or you could count number of values that have any duplicates (just the number 3 is duplicated):
R> sum(table(x) > 1)
[1] 1
If it is the latter, you could try:
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) > 1) / ((ncol(df) - 1) * 6)
})

By shamelessly stealing Ben's use of cross(), I have this approach that I personally find easier to read:
# Returns the number of overlapping elements
overlap <- function(xx){
length(unlist(xx)) - length(unique(unlist(xx)))
}
df_135 <- df %>%
as_tibble() %>%
filter(Group %in% c(1,3,5)) %>%
group_by(Group) %>%
mutate(Community = row_number()) %>%
nest(Members = starts_with("Isol_")) %>%
mutate(Members = map(Members, as.integer))
df_135
# A tibble: 12 x 3
# Groups: Group [3]
# Group Community Members
# <dbl> <chr> <list>
# 1 1 g1_1 <int [8]>
# 2 1 g1_2 <int [8]>
# 3 1 g1_3 <int [8]>
# 4 1 g1_4 <int [8]>
# 5 3 g3_1 <int [8]>
# 6 3 g3_2 <int [8]>
# 7 3 g3_3 <int [8]>
# 8 3 g3_4 <int [8]>
# 9 5 g5_1 <int [8]>
#10 5 g5_2 <int [8]>
#11 5 g5_3 <int [8]>
#12 5 g5_4 <int [8]>
# Compute all combinations across groups
all_combns <- cross(split(df_135$Members, df_135$Group))
# select the combinations with the desired overlap
all_combns[map_int(all_combns, overlap) == 0]
# [[1]]
# [[1]]$`1`
# [1] 125 8 40 127 19 33 29 3
#
# [[1]]$`3`
# [1] 11 25 126 22 56 4 6 52
#
# [[1]]$`5`
# [1] 5 63 18 48 37 32 43 1

Here's a plain R solution. It's not the most efficient one, but it's very straight forward and therefor very tractable.
The code below collects all the values in group 1 (1,3,5) and group 2 (2,4,6), and samples n isolates from this list. It then tests for the minimal overlap and resamples group 2 if necessary. In the case of your request, it only needs to resample once or twice, but if your threshold is lower (e.g. 0.05), it may resample up to 50 times before it gets it right. In fact, if your threshold is too low and your number of samples too large (i.e. it is impossible to make this sample), it will warn you that it failed.
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
df = as.data.frame(df)
subset1 <- df[df$Group %in% c(1,3,5),]
subset2 <- df[df$Group %in% c(2,4,6),]
values_in_subset1 <- subset1[2:ncol(subset1)] # Drop group column
values_in_subset1 <- as.vector(t(values_in_subset1)) # Convert to single vector
values_in_subset2 <- subset2[2:ncol(subset2)] # Drop group column
values_in_subset2 <- as.vector(t(values_in_subset2)) # Convert to single vector
n_sampled <- 8
sample1 <- sample(values_in_subset1, n_sampled, replace=F) #Replace=F is default, added here for readability
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
min_percentage_overlap <- 0.25
retries <- 1
# Retry until it gets it right
while(percentage_overlap > min_percentage_overlap && retries < 1000)
{
retries <- retries + 1
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
}
# Report on number of attempts
cat(paste("Sampled", retries, "times to make sure there was less than", min_percentage_overlap*100,"% overlap."))
# Finally, check if it worked.
if(percentage_overlap <= min_percentage_overlap){
cat("It's super effective! (not really though)")
} else {
cat("But it failed!")
}

Related

tidyverse and dplyr: Conditional replacement of values in a column based on other column [duplicate]

This question already has answers here:
Can dplyr package be used for conditional mutating?
(5 answers)
Closed 2 years ago.
I want to mutate a column A4 by A3 but reducing value of A3 by 1 if Total == 63. What am I doing wrong here?
tb1 %>%
mutate(A4 = replace(A3, Total == 63, A3-1))
The complete code with data is here
library(tidyverse)
tb1 <-
structure(
list(
A1 = c(16, 11, 16, 18, 20, 19, 16, 18, 20, 15,
17, 19, 19, 19, 16, 19, 16, 15, 19, 19, 16, 18, 18, 19, 19, 18,
20, 18, 19, 19, 19, 19, 17, 19, 17, 16, 18, 19, 16, 18, 17, 19,
19, 20, 17, 16, 18, 16, 15, 19, 19, 17, 20, 18, 16, 19, 19, 15,
17, 17, 19, 19, 16, 17, 18, 19, 17, 19, 17, 15, 19, 16, 17
)
, A2 = c(8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
)
, A3 = c(33, 34, 38, 36, 36, 34, 41, 36, 40, 38, 38, 41, 38, 34, 33, 36,
41, 40, 41, 38, 41, 33, 40, 38, 40, 38, 41, 41, 40, 41, 40,
38, 34, 40, 36, 41, 40, 40, 33, 38, 36, 41, 40, 40, 28, 41,
40, 41, 33, 41, 36, 36, 40, 34, 41, 41, 38, 38, 41, 38, 41,
41, 36, 40, 38, 38, 40, 41, 38, 22, 36, 34, 38
)
, Total = c(57, 53, 62, 62, 64, 61, 65, 62, 68, 61, 63, 68, 65, 61, 57, 63,
65, 63, 68, 65, 65, 59, 66, 65, 67, 64, 69, 67, 67, 68, 67,
65, 59, 67, 61, 65, 66, 67, 57, 64, 61, 68, 67, 68, 53, 65,
66, 65, 56, 68, 63, 61, 68, 60, 65, 68, 65, 61, 66, 63, 68,
68, 60, 65, 64, 65, 65, 68, 63, 45, 63, 58, 63
)
)
, class = "data.frame"
, row.names = c(NA, -73L)
)
tb1 %>%
filter(Total == 63)
#> A1 A2 A3 Total
#> 1 17 8 38 63
#> 2 19 8 36 63
#> 3 15 8 40 63
#> 4 19 8 36 63
#> 5 17 8 38 63
#> 6 17 8 38 63
#> 7 19 8 36 63
#> 8 17 8 38 63
tb2 <-
tb1 %>%
mutate(A4 = replace(A3, Total == 63, A3-1)) %>%
mutate(Total = A1 + A2 + A3)
#> Warning: Problem with `mutate()` input `A4`.
#> x number of items to replace is not a multiple of replacement length
#> ℹ Input `A4` is `replace(A3, Total == 63, A3 - 1)`.
tb2 %>%
filter(Total == 62)
#> A1 A2 A3 Total
#> 1 16 8 38 62
#> 2 18 8 36 62
#> 3 18 8 36 62

You are better using ifelse here :
library(dplyr)
tb1 %>% mutate(A4 = ifelse(Total == 63, A3 -1, A3))
As far as why replace does not work if you check the source code of replace :
replace
function (x, list, values)
{
x[list] <- values
x
}
It assigns values to x after subsetting for list.
When you use :
tb1 %>% mutate(A4 = replace(A3, Total == 63, A3-1))
your values is of length length(tb1$A3) but list is of length sum(tb1$Total == 63) which do not match hence you get the warning of number of items to replace is not a multiple of replacement length, since it tries recycling those values but still the length is unequal.
If you want to make replace work you can try :
tb1 %>% mutate(A4 = replace(A3, Total == 63, A3[Total == 63] -1))
but again as I mentioned it is easier to just use ifelse here.

Iterate over columns with NAs to create percentile variables with dplyr and data.table

I need quite a simple thing. To iterate over columns of a dataset to create percentil versions of said columns. I tried with dplyr and data.table but none seem to do what I need. Particulary, I need to exclude de NA values when creating the percentile versions of the columns.
Reproducible example below:
values<-c(19,
6,
27,
63,
50,
59,
97,
89,
NA,
9,
31,
58,
83,
2,
1,
31,
3,
1,
27,
40,
32,
42,
99,
NA,
12,
16,
23,
98,
44,
25,
13,
70,
64,
NA,
37,
75,
73,
59,
21,
3,
76,
43,
6,
96,
55,
48,
70,
90,
18,
58,
22,
19,
26,
49,
59,
94,
31,
45,
20,
8,
26,
56,
7,
11,
98,
50,
41,
38,
86,
0,
37,
NA,
40,
7,
88,
38,
41,
41,
19,
34,
21,
64,
87,
22,
54,
39,
75,
72,
91,
78)
values2<- c(98,
60,
9,
98,
NA,
88,
NA,
54,
92,
90,
NA,
83,
92,
65,
44,
NA,
98,
40,
26,
40,
54,
56,
15,
90,
15,
63,
57,
NA,
85,
69,
73,
43,
24,
27,
82,
75,
29,
98,
29,
5,
91,
88,
28,
12,
53,
NA,
2,
42,
86,
2,
78,
20,
50,
73,
77,
NA,
4,
39,
90,
NA,
29,
14,
98,
88,
77,
79,
30,
9,
74,
93,
NA,
16,
27,
16,
18,
40,
NA,
2,
66,
71,
82,
10,
62,
84,
25,
NA,
15,
12,
85,
50)
groups<-c(1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2)
df<-as.data.frame(cbind(groups,values,values2))
library(dplyr)
for (i in c("values","values2")) {
df<-df %>%
group_by(groups) %>%
mutate(!!sym(paste( i,"_percentile", sep="")) := percent_rank(na.omit(i)))
}
for (i in c("values","values2")) {
df<-df %>%
group_by(groups) %>%
mutate(!!sym(paste( i,"_percentile", sep="")) := rank(i)/length(i) )
}
library(data.table)
df<- as.data.table(df)
for (i in c("values","values2")) {
df[, paste(i,"_percentile",sep="") := rank(get(i))/length( get(i)), by = groups ]
}
for (i in c("values","values2")) {
df[!is.na(i), paste(i,"_percentile",sep="") := rank(get(i))/length( get(i)), by = groups ]
}

An option is mutate_at. After grouping by 'groups', use mutate_at to loop over the columns that starts_with ('values') as column name, replace, the values where the values are not NA with the percent_rank of the non-NA elements
library(dplyr)
df %>%
group_by(groups) %>%
mutate_at(vars(starts_with('values')),
list(percentile = ~ replace(., !is.na(.), percent_rank(.[!is.na(.)]))))
Or with data.table
library(data.table)
nm1 <- paste(names(df1)[2:3], "_percentile")
setDT(df)[, (nm1) := lapply(.SD, function(x) replace(x, !is.na(x),
frank(x[!is.na(x)])/sum(!is.na(x)))), .SDcols = 2:3, by = groups]

My tidyverse answer has the same structure as #akrun's -- using mutate_at to add multiple columns, starts_with to select the columns. A few things worth pointing out with the more minimal example:
The percent_rank function already removes NA's when it calculates, so you don't have to do the additional work to filter them out of the calc.
There is one degenerate case where there's only one actual measure. (In my case, it's group "b"). percent_rank can return a NaN value there because it's scaling the min_rank. Inside the direct mutate_at, that issue seems to be avoided. (It's unclear what value that should be assigned to in your case).
There's another sort-of degenerate case when there's a tie. In group "a", I have a tie for first place, and the percent_rank's are accordingly not 1.0.
library(tidyverse)
df <- tribble(
~groups, ~values1, ~values2,
"a", 1, 10,
"a", 2, 10,
"a", NA, 8,
"a", 3, 9,
"a", 4, 7,
"b", NA, 10,
"b", 2, NA,
"b", NA, 8
)
df %>%
group_by(groups) %>%
mutate_at(
vars(starts_with("values")),
list(percentile = ~ percent_rank(.)))
#> # A tibble: 8 x 5
#> # Groups: groups [2]
#> groups values1 values2 values1_percentile values2_percentile
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 10 0 0.75
#> 2 a 2 10 0.333 0.75
#> 3 a NA 8 NA 0.25
#> 4 a 3 9 0.667 0.5
#> 5 a 4 7 1 0
#> 6 b NA 10 NA 1
#> 7 b 2 NA 0 NA
#> 8 b NA 8 NA 0

creating named vector from a csv file did not work

Creating named vector where names are associated to GO id from a csv file did not work.
> head(read.delim("~/GOmapping.tsv", sep = '\t'))
V1 V14
1 sp0000005 GO:0003723
2 sp0000006 GO:0016021
3 sp0000007 GO:0003700,GO:0006355,GO:0043565
4 sp0000016 GO:0046983
5 sp0000017 GO:0004672,GO:0005524,GO:0006468
6 sp0000022 GO:0003677,GO:0046983
> head(read.delim("~/GOmapping.tsv", sep = '\t'))[1]
V1
1 sp0000005
2 sp0000006
3 sp0000007
4 sp0000016
5 sp0000017
6 sp0000022
> head(read.delim("~/GOmapping.tsv", sep = '\t'))[2]
V14
1 GO:0003723
2 GO:0016021
3 GO:0003700,GO:0006355,GO:0043565
4 GO:0046983
5 GO:0004672,GO:0005524,GO:0006468
6 GO:0003677,GO:0046983
> geneID2GO <- read.delim("~/GOmapping.tsv", sep = '\t'))[2]
> geneID2GO <- read.delim("~/GOmapping.tsv", sep = '\t')[2]
> names(geneID2GO) <- read.delim("~/GOmapping.tsv", sep = '\t')[1]
> head(geneID2GO)
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 57, 58, 59, 60, 6 ...
1 GO:0003723
2 GO:0016021
3 GO:0003700,GO:0006355,GO:0043565
4 GO:0046983
5 GO:0004672,GO:0005524,GO:0006468
6 GO:0003677,GO:0046983
What did I miss?
Thank you in advance.

If you want a vector as result, maybe you could try to coerce your values and names (column 1) to character.
data <- read.delim("~/GOmapping.tsv", sep = '\t')
geneID2GO <- as.character(data[,2])
names(geneID2GO) <- as.character(data[,1])
head(geneID2GO)
sp0000005 sp0000006 sp0000007
"GO:0003723" "GO:0016021" "GO:0003700,GO:0006355,GO:0043565"
sp0000016
"GO:0046983"
Alternatively, you can display the result as follows:
cbind(geneID2GO)
geneID2GO
sp0000005 "GO:0003723"
sp0000006 "GO:0016021"
sp0000007 "GO:0003700,GO:0006355,GO:0043565"
sp0000016 "GO:0046983"

Trying to Repeat, but data is not a multiple

So I am trying to label a data matrix with conditions; however, when I did my experiment, I had 3 tubes where I repeated the first two 7 times and the third tube 6 times. How can I code the matrix to be re-written and ignore that there is "missing" data:
dm$Strain<-dm$variable
dm$Strain<-rep(c("446-1", "446-2", "446-3"), each.out=193)
dm$Strain<-factor(dm$Strain)
levels(dm$Strain)
Error in $<-.data.frame(*tmp*, "Strain", value = c("446-1", "446-2", :
replacement has 3 rows, data has 19300
Data Setup in Wells:
1) Control = 1, 16, 31, 46, 61, 76, 91
2) LI 446-1 tube = 2, 17, 32, 47, 62, 77, 92
3) LI 446-1 10^7 = 3, 18, 33, 48, 63, 78, 93
4) LI 446-1 10^6 = 4, 19, 34, 49, 64, 79, 94
5) LI 446-1 10^5 = 5, 20, 35, 50, 65, 80, 95
6) Control = 6, 21, 36, 51, 66, 81, 96
7) LI-446-2 tube = 7, 22, 37, 52, 67, 82, 97
8) LI-446-2 10^7 = 8, 23, 38, 53, 68, 83, 98
9) LI-446-2 10^6 = 9, 24, 39, 54, 69, 84, 99
10) LI-446-2 10^5 = 10, 25, 40 ,55, 70, 85, 100
11) Control = 11, 26, 41, 56, 71, 86
12) LI-446-3 tube = 12, 27, 42, 57, 72, 87
13) LI-446-3 10^7 = 13, 28, 43, 58, 73, 88
14) LI-446-3 10^6 = 14, 29, 44, 59, 74, 89
15) LI-446-3 10^5 = 15, 30, 45, 60, 75, 90
I have 19300 columns of data, where 1:193 correspond to Well 1 at 15min intervals, 194:386 are Well 2 at 15 min intervals, etc up to Well 100. However, 446-3 (AKA 11-15 above) are repeated 6 times and 446-1 and 446-2 are repeated 7 times.
str(dm)
'data.frame': 19300 obs. of 4 variables:
$ Time..mins.: int 15 30 45 60 75 90 105 120 135 150 ...
$ variable : Factor w/ 100 levels "Well_1","Well_2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0.439 0.204 0.191 0.187 0.185 0.19 0.187 0.19 0.188 0.191 ...
$ Media : Factor w/ 2 levels "BHI","BHI_salt": 1 1 1 1 1 1 1 1 1 1 ...

R - Keep reading line if 7 or more numbers are => 10

I have a file foo.txt that looks like this:
7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5, 13, 34, 24, 17, 50, 30, 42, 35, 29, 27, 52, 35, 44, 52, 36, 39, 25, 40, 50, 52, 40, 2, 52, 52, 31, 35, 30, 19, 32, 46, 50, 43, 36, 15, 21, 16, 36, 25, 7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6
I want to read the numbers in sets of 15, moving to the right one number at the time:
7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5
then
3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5, 22
and so on.
If 7 or more of those 15 numbers are =>10 then keep them in a growing object that ends when the condition isn't met. So the first one to keep would be
3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5, 13
because 7 out of those 15 numbers are => 10 (those numbers are 22, 18, 14, 23, 16, 18 and 13
The output file would look like this:
3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5, 13, 34, 24, 17, 50, 30, 42, 35, 29, 27, 52, 35, 44, 52, 36, 39, 25, 40, 50, 52, 40, 2, 52, 52, 31, 35, 30, 19, 32, 46, 50, 43, 36, 15, 21, 16, 36, 25, 7, 3, 5, 7, 3, 3, 3, 3
So far I'm stuck at getting sets of 15 digits but I don't know how to make the condition "7 or more must be => 10"
qual <- readLines("foo.txt", 1)
separados <- unlist(strsplit(qual, ", "))
for (i in 1:length(qual)) {
separados[(i):(i + 14)] -> numbers
I don't mind the language as long as it does the work

I've added two ='s to Vlo's solutions and made this for you. Does this answer your question?
foo.txt <- c(7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5,
13, 34, 24, 17, 50, 30, 42, 35, 29, 27, 52, 35, 44, 52, 36, 39, 25, 40,
50, 52, 40, 2, 52, 52, 31, 35, 30, 19, 32, 46, 50, 43, 36, 15, 21, 16,
36, 25, 7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6)
# install.packages(c("zoo"), dependencies = TRUE)
require(zoo)
bar <- rollapply(foo.txt, 15, function(x) sum(x >= 10 ) >= 7)
(product <- foo.txt[bar])
[1] 3 3 3 6 7 5 5 22 18 14 23 16 18 5 13 34 24 17 50 30 42 35 29 27
[25] 52 35 44 52 36 39 25 40 50 52 40 2 52 52 31 35 30 19 32 46 50 43 3 3
[49] 3 3 3 6

I would do it in Python (you said you don't mind the language):
array = []
with open("foo.txt","r") as f:
for line in f:
for num in line.strip().split(', '):
array.append(int(num))
result = []
growing = False
while len(array) >= 15:
if sum(1 for e in filter(lambda x: x>=10, array[:15])) >= 7:
if growing:
result.append(array[15])
else:
result.extend(array[:15])
growing = True
else:
growing = False
del(array[0])
print(str(result)[1:-1])
Short explanation: first while simply reads the lines in the file, strips end of line, separates every number between ", " characters and appends each number to array.
Second while checks the first 15 numbers in array; if they have at least 7 numbers >= 0, it appends all the numbers, or just the last one (depending if the last iteration), to result. At the end of the loop, it removes the first number in array so that the loop can continue with the next 15 numbers.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Selecting Combinations Across Columns Row by Row With Overlap Threshold - r

Related

tidyverse and dplyr: Conditional replacement of values in a column based on other column [duplicate]

Iterate over columns with NAs to create percentile variables with dplyr and data.table

creating named vector from a csv file did not work

Trying to Repeat, but data is not a multiple

R - Keep reading line if 7 or more numbers are => 10

Categories

Resources