Related
I have a dataframe with different company IDs appearing from once to over 30 times in different rows. I want to add a new column "di_Flex" and fill it with specific values depending on how often the same company ID appears in a column:
If it appears twice in the column, add the value 6 to the new column "di_Flex",
if it appears 3x, add "8",
if it appears 4x add "10",
if it appears 5x add "12.8",
if it appears 6x add "14.67",
if it appears 7 or more times add "16".
Here is the dataframe:
c(0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 7, 7, 8, 9, 9, 9, 10, 10, 11, 11, 12, 12, 13, 14,
15, 16, 17, 17, 18, 18, 19, 20, 21, 22, 23, 23, 23, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27,
28, 29, 30, 31, 31, 32, 32, 32, 33, 33, 33, 34, 34, 34, 35, 36,
36, 37, 38, 38, 38, 38, 38, 38, 39, 40, 41, 41, 41, 42, 42, 42,
43, 43, 43, 44, 45, 45, 46, 46, 46, 47, 48, 49, 50, 50, 51, 53,
54, 54, 54, 54, 55, 57, 57, 57, 59, 59, 59, 59, 60, 60, 60, 60,
61, 61, 62, 62, 62, 63, 63, 64, 64, 64, 64, 65, 65, 66, 66, 66,
66, 66, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)
Thank you for your help!
Assuming your data is called df with a column value:
library(tidyverse)
left_join(df, df %>%
group_by(value) %>%
tally()) %>%
mutate(di_Flex = case_when(n == 2 ~ 6,
n == 3 ~ 8,
n == 4 ~ 10,
n == 5 ~ 12.8,
n == 6 ~ 14.67,
n >= 7 ~ 16)) %>%
select(-n)
This gives us:
1 0 12.8
2 0 12.8
3 0 12.8
4 0 12.8
5 0 12.8
6 1 NA
7 2 NA
8 3 NA
9 4 NA
10 5 8.0
11 5 8.0
12 5 8.0
13 6 16.0
14 6 16.0
15 6 16.0
16 6 16.0
17 6 16.0
18 6 16.0
19 6 16.0
20 6 16.0
Data:
df <- data.frame(value = c(0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 7, 7, 8, 9, 9, 9, 10, 10, 11, 11, 12, 12, 13, 14,
15, 16, 17, 17, 18, 18, 19, 20, 21, 22, 23, 23, 23, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,
25, 25, 26, 26, 26, 26, 26, 26, 26, 26, 26, 27, 27, 27, 27, 27,
28, 29, 30, 31, 31, 32, 32, 32, 33, 33, 33, 34, 34, 34, 35, 36,
36, 37, 38, 38, 38, 38, 38, 38, 39, 40, 41, 41, 41, 42, 42, 42,
43, 43, 43, 44, 45, 45, 46, 46, 46, 47, 48, 49, 50, 50, 51, 53,
54, 54, 54, 54, 55, 57, 57, 57, 59, 59, 59, 59, 60, 60, 60, 60,
61, 61, 62, 62, 62, 63, 63, 64, 64, 64, 64, 65, 65, 66, 66, 66,
66, 66, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA))
I have a data frame that has rows that represent communities. For columns, the first column is the group that the community falls into (a total of 6 groups) and the remaining 8 are IDs of each member of the community.
What I would like to do is have a community (row) within groups 1, 3, and 5 to be picked where there is no overlap between them. Then, once I have that - I would like to pick a community from groups 2, 4, and 6 where there is no more than 25% overlap between the selected 6 total communities.
Here is an example dataset:
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
Based on the criteria I mentioned above, the following could be pulled out:
Group 1: 125, 8, 40, 127, 19, 33, 29, 3
Group 3: 11, 25, 126, 22, 56, 4, 6, 52
Group 5: 5, 63, 18, 48, 37, 32, 43, 1
Group 2: 25, 37, 8, 38, 40, 124, 32, 56
Group 4: 125, 15, 29, 4, 48, 5, 128, 11
Group 6: 34, 23, 33, 32, 63, 22, 19, 56
I believe this might be helpful (please let me know if not!).
The first step would be to subset your data into Group 1, 3, and 5. Then using transpose from purrr, splitting by Group, with cross we can get all combinations selecting one row from each group.
library(purrr)
grp_135 <- df[df$Group %in% c(1, 3, 5), ]
all_combn_135 <- lapply(cross(split(transpose(grp_135), grp_135$Group)), bind_rows)
Checking the first element to see what we have:
R> all_combn_135[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 29 4 3 63 125 40 32 38
3 5 5 63 18 48 37 32 43 1
Next, we can check for overlap by counting duplicates. In this case, I just unlist the three rows, use table for frequency, and sum up (subtracting 1 for each value found, since only want duplicates).
combn_ovlp_135 <- lapply(all_combn_135, function(x) {
sum(table(unlist(x[-1])) - 1)
})
The ones without overlap can be obtained by:
no_ovlp <- all_combn_135[combn_ovlp_135 == 0]
no_ovlp
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 11 25 126 22 56 4 6 52
3 5 5 63 18 48 37 32 43 1
For the next part, do something similar (this can be broken out as a generalized function), except when checking for overlap, combine elements with the first no_ovlp from previously:
grp_246 <- df[df$Group %in% c(2, 4, 6), ]
all_combn_246 <- lapply(cross(split(transpose(grp_246), grp_246$Group)), bind_rows)
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) - 1) / ((ncol(df) - 1) * 6)
})
It is not entirely clear how you want to calculate overlap for this part and compare with 25%. I counted duplicates and then divided by the number of columns (8 not counting Group) and multiply by 6 (rows). To see which combination of Group 2, 4, and 6 could be combined with no_ovlp you can try the following:
all_combn_246[combn_ovlp_246 < .25]
In my case, I believe none of the combinations met this criterion, although the first with 37.5% overlap was the minimum:
R> all_combn_246[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 25 37 8 38 40 124 32 56
2 4 125 15 29 4 48 5 128 11
3 6 34 23 33 32 63 22 19 56
What was unclear is how to count duplicates. For example, how much overlap is c(1, 2, 3, 3, 3)?
This could be two duplicates (two extra 3's):
R> sum(table(x) - 1)
[1] 2
Or you could count number of values that have any duplicates (just the number 3 is duplicated):
R> sum(table(x) > 1)
[1] 1
If it is the latter, you could try:
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) > 1) / ((ncol(df) - 1) * 6)
})
By shamelessly stealing Ben's use of cross(), I have this approach that I personally find easier to read:
# Returns the number of overlapping elements
overlap <- function(xx){
length(unlist(xx)) - length(unique(unlist(xx)))
}
df_135 <- df %>%
as_tibble() %>%
filter(Group %in% c(1,3,5)) %>%
group_by(Group) %>%
mutate(Community = row_number()) %>%
nest(Members = starts_with("Isol_")) %>%
mutate(Members = map(Members, as.integer))
df_135
# A tibble: 12 x 3
# Groups: Group [3]
# Group Community Members
# <dbl> <chr> <list>
# 1 1 g1_1 <int [8]>
# 2 1 g1_2 <int [8]>
# 3 1 g1_3 <int [8]>
# 4 1 g1_4 <int [8]>
# 5 3 g3_1 <int [8]>
# 6 3 g3_2 <int [8]>
# 7 3 g3_3 <int [8]>
# 8 3 g3_4 <int [8]>
# 9 5 g5_1 <int [8]>
#10 5 g5_2 <int [8]>
#11 5 g5_3 <int [8]>
#12 5 g5_4 <int [8]>
# Compute all combinations across groups
all_combns <- cross(split(df_135$Members, df_135$Group))
# select the combinations with the desired overlap
all_combns[map_int(all_combns, overlap) == 0]
# [[1]]
# [[1]]$`1`
# [1] 125 8 40 127 19 33 29 3
#
# [[1]]$`3`
# [1] 11 25 126 22 56 4 6 52
#
# [[1]]$`5`
# [1] 5 63 18 48 37 32 43 1
Here's a plain R solution. It's not the most efficient one, but it's very straight forward and therefor very tractable.
The code below collects all the values in group 1 (1,3,5) and group 2 (2,4,6), and samples n isolates from this list. It then tests for the minimal overlap and resamples group 2 if necessary. In the case of your request, it only needs to resample once or twice, but if your threshold is lower (e.g. 0.05), it may resample up to 50 times before it gets it right. In fact, if your threshold is too low and your number of samples too large (i.e. it is impossible to make this sample), it will warn you that it failed.
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
df = as.data.frame(df)
subset1 <- df[df$Group %in% c(1,3,5),]
subset2 <- df[df$Group %in% c(2,4,6),]
values_in_subset1 <- subset1[2:ncol(subset1)] # Drop group column
values_in_subset1 <- as.vector(t(values_in_subset1)) # Convert to single vector
values_in_subset2 <- subset2[2:ncol(subset2)] # Drop group column
values_in_subset2 <- as.vector(t(values_in_subset2)) # Convert to single vector
n_sampled <- 8
sample1 <- sample(values_in_subset1, n_sampled, replace=F) #Replace=F is default, added here for readability
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
min_percentage_overlap <- 0.25
retries <- 1
# Retry until it gets it right
while(percentage_overlap > min_percentage_overlap && retries < 1000)
{
retries <- retries + 1
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
}
# Report on number of attempts
cat(paste("Sampled", retries, "times to make sure there was less than", min_percentage_overlap*100,"% overlap."))
# Finally, check if it worked.
if(percentage_overlap <= min_percentage_overlap){
cat("It's super effective! (not really though)")
} else {
cat("But it failed!")
}
This question already has answers here:
Can dplyr package be used for conditional mutating?
(5 answers)
Closed 2 years ago.
I want to mutate a column A4 by A3 but reducing value of A3 by 1 if Total == 63. What am I doing wrong here?
tb1 %>%
mutate(A4 = replace(A3, Total == 63, A3-1))
The complete code with data is here
library(tidyverse)
tb1 <-
structure(
list(
A1 = c(16, 11, 16, 18, 20, 19, 16, 18, 20, 15,
17, 19, 19, 19, 16, 19, 16, 15, 19, 19, 16, 18, 18, 19, 19, 18,
20, 18, 19, 19, 19, 19, 17, 19, 17, 16, 18, 19, 16, 18, 17, 19,
19, 20, 17, 16, 18, 16, 15, 19, 19, 17, 20, 18, 16, 19, 19, 15,
17, 17, 19, 19, 16, 17, 18, 19, 17, 19, 17, 15, 19, 16, 17
)
, A2 = c(8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
)
, A3 = c(33, 34, 38, 36, 36, 34, 41, 36, 40, 38, 38, 41, 38, 34, 33, 36,
41, 40, 41, 38, 41, 33, 40, 38, 40, 38, 41, 41, 40, 41, 40,
38, 34, 40, 36, 41, 40, 40, 33, 38, 36, 41, 40, 40, 28, 41,
40, 41, 33, 41, 36, 36, 40, 34, 41, 41, 38, 38, 41, 38, 41,
41, 36, 40, 38, 38, 40, 41, 38, 22, 36, 34, 38
)
, Total = c(57, 53, 62, 62, 64, 61, 65, 62, 68, 61, 63, 68, 65, 61, 57, 63,
65, 63, 68, 65, 65, 59, 66, 65, 67, 64, 69, 67, 67, 68, 67,
65, 59, 67, 61, 65, 66, 67, 57, 64, 61, 68, 67, 68, 53, 65,
66, 65, 56, 68, 63, 61, 68, 60, 65, 68, 65, 61, 66, 63, 68,
68, 60, 65, 64, 65, 65, 68, 63, 45, 63, 58, 63
)
)
, class = "data.frame"
, row.names = c(NA, -73L)
)
tb1 %>%
filter(Total == 63)
#> A1 A2 A3 Total
#> 1 17 8 38 63
#> 2 19 8 36 63
#> 3 15 8 40 63
#> 4 19 8 36 63
#> 5 17 8 38 63
#> 6 17 8 38 63
#> 7 19 8 36 63
#> 8 17 8 38 63
tb2 <-
tb1 %>%
mutate(A4 = replace(A3, Total == 63, A3-1)) %>%
mutate(Total = A1 + A2 + A3)
#> Warning: Problem with `mutate()` input `A4`.
#> x number of items to replace is not a multiple of replacement length
#> ℹ Input `A4` is `replace(A3, Total == 63, A3 - 1)`.
tb2 %>%
filter(Total == 62)
#> A1 A2 A3 Total
#> 1 16 8 38 62
#> 2 18 8 36 62
#> 3 18 8 36 62
You are better using ifelse here :
library(dplyr)
tb1 %>% mutate(A4 = ifelse(Total == 63, A3 -1, A3))
As far as why replace does not work if you check the source code of replace :
replace
function (x, list, values)
{
x[list] <- values
x
}
It assigns values to x after subsetting for list.
When you use :
tb1 %>% mutate(A4 = replace(A3, Total == 63, A3-1))
your values is of length length(tb1$A3) but list is of length sum(tb1$Total == 63) which do not match hence you get the warning of number of items to replace is not a multiple of replacement length, since it tries recycling those values but still the length is unequal.
If you want to make replace work you can try :
tb1 %>% mutate(A4 = replace(A3, Total == 63, A3[Total == 63] -1))
but again as I mentioned it is easier to just use ifelse here.
This question already has answers here:
Split column at delimiter in data frame [duplicate]
(6 answers)
Closed 7 years ago.
data.df
colA
2 AD
3 KF
4 GH
I want to split this column into two columns
colA ColB
2 AD
3 KF
4 GH
Here's my code:
library(dplyr)
X1 <- data.df
ca <- as.data.frame(X1) %>% separate(X1,col=colA, into = paste("colA","colB"))
Error: Values not split into 1 pieces at 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64
What's wrong with my code?
When you pipe commands the first argument to each function will the data.frame, so try
library(tidyr)
dat %>% separate(colA, c("colA", "colB"))
# colA colB
# 1 2 AD
# 2 3 KF
# 3 4 GH
Data
dat <- structure(list(colA = structure(1:3, .Label = c("2 AD", "3 KF",
"4 GH"), class = "factor")), .Names = "colA", row.names = c(NA,
-3L), class = "data.frame")
We could do this using read.table from base R. If the 'colA' in the initial dataset is factor class, convert to character, and use read.table. We can specify the column names with col.names argument.
read.table(text=as.character(dat$colA), sep='',
col.names=paste0('col', c('A', 'B')) )
# colA colB
#1 2 AD
#2 3 KF
#3 4 GH
Another option is cSplit from splitstackshape. We specify the column to be split in the splitCols and the sep. The direction is 'wide' by default.
library(splitstackshape)
cSplit(dat, 'colA', ' ')
NOTE: "dat" from #nongkrong's post
I have two distinct vectors from which I've indexed every possible combination of perfect matches:
starts <- c(54, 54, 18, 20, 22, 22, 33, 33, 33, 37, 42, 44, 44, 51, 11, 17, 19, 19, 19, 19, 22, 23, 23, 24, 24)
ends <- c(22, 14, 14, 14, 14, 14, 14, 14, 14, 24, 24, 25, 25, 25, 25, 26, 26, 29, 30, 31, 32, 33, 33, 33, 33)
which(outer(starts, ends, "=="), arr.ind=TRUE)
Now, instead of trying to find exact matches, I'd like to find combinations of components that fall within a certain range of each other: say +/- 5. I've made a range (-5:5) and tried introducing it as a function in place of "==", but it hasn't really worked out.
Thank you very much.
You can do this by writing a small helper function that does the comparison:
cmp <- function(x, y, cutoff=5){abs(x-y) <= cutoff}
which(outer(starts, ends, cmp), arr.ind=TRUE)
row col
[1,] 3 1
[2,] 4 1
[3,] 5 1
[4,] 6 1
[5,] 16 1
[6,] 17 1
[7,] 18 1
... etc.