aggregate (R) behaving differently for apparently identical tasks - r

I've been banging my head against a brick wall for days on this issue; I wonder if anyone can see what is wrong with my code, or tell me if I am overlooking something obvious.
I have this data.frame, where most columns are vectors, either numerical or character, and one column is a list of character vectors:
t0g2 <- structure(list(P = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5, 5), ID = c(8, 10, 7, 9, 5, 2, 3, 4, 8, 9, 1, 2,
8, 1, 4, 10, 4, 10, 2, 7), SC = c("A", "D", "A", "B", "B", "A",
"A", "E", "A", "B", "D", "A", "A", "D", "E", "D", "E", "D", "A",
"A"), FP = list(`40,41,37,8,11` = c("40", "41", "37", "8", "11"
), `49,28,16,41` = c("49", "28", "16", "41"), `15,49` = c("15",
"49"), `27,12,20,35,45` = c("27", "12", "20", "35", "45"), `1,34,43,37` = c("1",
"34", "43", "37"), `41,7,30,2,34,43` = c("41", "7", "30", "2",
"34", "43"), `22,35,31,10,3` = c("22", "35", "31", "10", "3"),
`29,6,15` = c("29", "6", "15"), `40,41,37,8,11` = c("40",
"41", "37", "8", "11"), `27,12,20,35,45` = c("27", "12",
"20", "35", "45"), `10,49,28` = c("10", "49", "28"), `41,7,30,2,34,43` = c("41",
"7", "30", "2", "34", "43"), `40,41,37,8,11` = c("40", "41",
"37", "8", "11"), `10,49,28` = c("10", "49", "28"), `29,6,15` = c("29",
"6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `29,6,15` = c("29",
"6", "15"), `49,28,16,41` = c("49", "28", "16", "41"), `41,7,30,2,34,43` = c("41",
"7", "30", "2", "34", "43"), `15,49` = c("15", "49"))), class = "data.frame", row.names = c("8",
"10", "7", "9", "5", "2", "3", "4", "81", "91", "1", "21", "82",
"11", "41", "101", "42", "102", "22", "71"))
I want to aggregate it by one of the columns, with the function for the other columns being simply the concatenation of unique values. [Yes, I know this can be done with many ad hoc packages, but I need to do it with base R].
This works perfectly well if I choose numeric column "ID" as the column to aggregate on:
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("ID"))], by=list(ID=t0g2[["ID"]]),
FUN=function(y) unique(unlist(y)))
# ID P SC FP
#1 1 3, 4 D 10, 49, 28
#2 2 2, 3, 5 A 41, 7, 30, 2, 34, 43
#3 3 2 A 22, 35, 31, 10, 3
#4 4 2, 4, 5 E 29, 6, 15
#5 5 2 B 1, 34, 43, 37
#6 7 1, 5 A 15, 49
#7 8 1, 3, 4 A 40, 41, 37, 8, 11
#8 9 1, 3 B 27, 12, 20, 35, 45
#9 10 1, 4, 5 D 49, 28, 16, 41
or with character column "SC":
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("SC"))], by=list(SC=t0g2[["SC"]]),
FUN=function(y) unique(unlist(y)))
# SC P ID FP
#1 A 1, 2, 3, 4, 5 8, 7, 2, 3 40, 41, 37, 8, 11, 15, 49, 7, 30, 2, 34, 43, 22, 35, 31, 10, 3
#2 B 1, 2, 3 9, 5 27, 12, 20, 35, 45, 1, 34, 43, 37
#3 D 1, 3, 4, 5 10, 1 49, 28, 16, 41, 10
#4 E 2, 4, 5 4 29, 6, 15
However, if I try with "P", which as far as I know is just another numerical column, this is what I get:
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]),
FUN=function(y) unique(unlist(y)))
# P ID.1 ID.2 ID.3 ID.4 SC.1 SC.2 SC.3 FP
#1 1 8 10 7 9 A D B 40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2 2 5 2 3 4 B A E 1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3 3 8 9 1 2 A B D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4 4 8 1 4 10 A D E 40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5 5 4 10 2 7 E D A 29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43
Does anybody know what is going on, why this happens?
Literally going mental with this stuff...
EDIT: adding an example of the desired output from aggregating on "P", as requested by jay.sf.
# P ID SC FP
#1 1 8, 10, 7, 9 A, D, B 40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
#2 2 5, 2, 3, 4 B, A, E 1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
#3 3 8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
#4 4 8, 1, 4, 10 A, D, E 40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
#5 5 4, 10, 2, 7 E, D, A 29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43
In fact, I found out that by setting simplify=F in aggregate, it works as I want.
I hope this won't backfire.
EDIT 2: it did backfire...
I don't want all my columns to become lists even when they can be vectors, but with simplify = F they do become lists:
sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("P"))],by=list(P=t0g2[["P"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
# P ID SC FP
#"numeric" "list" "list" "list"
sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = T),class)
# ID P SC FP
# "numeric" "list" "character" "list"
sapply(aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F),class)
# ID P SC FP
#"numeric" "list" "list" "list"
So I still don't have a solution... :(
EDIT 3: maybe a viable (if rather clumsy) solution?
t0g2_by_ID <- aggregate(x=t0g2[,!(colnames(t0g2) %in% c("ID"))],by=list(ID=t0g2[["ID"]]),FUN=function(y) unique(unlist(y)), simplify = F)
sapply(t0g2_by_ID,class)
# ID P SC FP
#"numeric" "list" "list" "list"
for (i in 1:NCOL(t0g2_by_ID)) {y = t0g2_by_ID[,i]; if ((class(y) == "list") & (length(y) == length(unlist(y)))) {t0g2_by_ID[,i] <- unlist(y)} }
sapply(t0g2_by_ID,class)
# ID P SC FP
#"numeric" "list" "character" "list"
I tried to obviate to the inelegant loop using sapply, but then any cbind operation goes back to a data.frame of lists.
This is the best I can come up with.
If anyone can suggest how to do this better using only base R, that'd be great.

aggregate obviously tries to give a matrix where this is possible. See This example:
# data
n <- 10
df <- data.frame(id= rep(1:2, each= n/2),
value= 1:n)
length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
TRUE
Here the length of unique is same for every id value, thus aggregate provides a matrix
aggregate(x= df[, "value"], by=list(id=df[, "id"]),
FUN=function(y) unique(unlist(y)))
id x.1 x.2 x.3 x.4 x.5
1 1 1 2 3 4 5
2 2 6 7 8 9 10
Now we change data so that length of unique per id is not equal
df$value[2] <- 1
length(unique(df$value[df$id == 1])) == length(unique(df$value[df$id == 2]))
FALSE
In this case we get an output with values separated by ,:
aggregate(x= df[, "value"], by=list(id=df[, "id"]),
FUN=function(y) unique(unlist(y)))
id x
1 1 1, 3, 4, 5
2 2 6, 7, 8, 9, 10
In your case you have for every P value exactly 4 unique ID values and exactly 3 unique SC values, hence, aggregate shows those results as a matrix. This is not true for FP: here aggregate can't provide a matrix, hence, we get the values separated by ,

aggregate has an argument simplify that is TRUE by default, which means it tries to simplify to a vector or matrix when possible. All groups in P have n = 4, so your aggregated data is being simplified to a matrix. Just set simpflify = FALSE to change this behavior:
aggregate(x=t0g2[, !(colnames(t0g2) %in% c("P"))], by=list(P=t0g2[["P"]]),
FUN=function(y) unique(unlist(y)), simplify = F)
#### OUTPUT ####
P ID SC FP
1 1 8, 10, 7, 9 A, D, B 40, 41, 37, 8, 11, 49, 28, 16, 15, 27, 12, 20, 35, 45
2 2 5, 2, 3, 4 B, A, E 1, 34, 43, 37, 41, 7, 30, 2, 22, 35, 31, 10, 3, 29, 6, 15
3 3 8, 9, 1, 2 A, B, D 40, 41, 37, 8, 11, 27, 12, 20, 35, 45, 10, 49, 28, 7, 30, 2, 34, 43
4 4 8, 1, 4, 10 A, D, E 40, 41, 37, 8, 11, 10, 49, 28, 29, 6, 15, 16
5 5 4, 10, 2, 7 E, D, A 29, 6, 15, 49, 28, 16, 41, 7, 30, 2, 34, 43

Related

How can I recode a variable in R, so that the lowest value will be 1, the second lowest will be 2 etc

Imagine I have a tidy dataset with 1 variable and 10 observations. The values of the variable are e.g. 3, 5, 7, 9, 13, 17, 29, 33, 34, 67. How do I recode it so that the 3 will be 1, the 5 will be 2 (...) and the 67 will be 10?
One possibility is to use rank: in a ´dplyr` setting it could look like this:
library(dplyr)
tibble(x = c(3, 5, 7, 9, 13, 17, 29, 33, 34, 67)) %>%
mutate(y = rank(x))
Here is one way -
x <- c(3, 5, 7, 9, 13, 17, 29, 33, 67, 34)
x1 <- sort(x)
y <- match(x1, unique(x1))
y
#[1] 1 2 3 4 5 6 7 8 9 10
Changed the order of last 2 values so that it also works when the data is not in order.
Another way:
x <- c(3, 5, 7, 9, 13, 17, 29, 33, 67, 34)
x <- sort(x)
seq_along(x)
# 1 2 3 4 5 6 7 8 9 10

Selecting Combinations Across Columns Row by Row With Overlap Threshold

I have a data frame that has rows that represent communities. For columns, the first column is the group that the community falls into (a total of 6 groups) and the remaining 8 are IDs of each member of the community.
What I would like to do is have a community (row) within groups 1, 3, and 5 to be picked where there is no overlap between them. Then, once I have that - I would like to pick a community from groups 2, 4, and 6 where there is no more than 25% overlap between the selected 6 total communities.
Here is an example dataset:
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
Based on the criteria I mentioned above, the following could be pulled out:
Group 1: 125, 8, 40, 127, 19, 33, 29, 3
Group 3: 11, 25, 126, 22, 56, 4, 6, 52
Group 5: 5, 63, 18, 48, 37, 32, 43, 1
Group 2: 25, 37, 8, 38, 40, 124, 32, 56
Group 4: 125, 15, 29, 4, 48, 5, 128, 11
Group 6: 34, 23, 33, 32, 63, 22, 19, 56
I believe this might be helpful (please let me know if not!).
The first step would be to subset your data into Group 1, 3, and 5. Then using transpose from purrr, splitting by Group, with cross we can get all combinations selecting one row from each group.
library(purrr)
grp_135 <- df[df$Group %in% c(1, 3, 5), ]
all_combn_135 <- lapply(cross(split(transpose(grp_135), grp_135$Group)), bind_rows)
Checking the first element to see what we have:
R> all_combn_135[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 29 4 3 63 125 40 32 38
3 5 5 63 18 48 37 32 43 1
Next, we can check for overlap by counting duplicates. In this case, I just unlist the three rows, use table for frequency, and sum up (subtracting 1 for each value found, since only want duplicates).
combn_ovlp_135 <- lapply(all_combn_135, function(x) {
sum(table(unlist(x[-1])) - 1)
})
The ones without overlap can be obtained by:
no_ovlp <- all_combn_135[combn_ovlp_135 == 0]
no_ovlp
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 125 8 40 127 19 33 29 3
2 3 11 25 126 22 56 4 6 52
3 5 5 63 18 48 37 32 43 1
For the next part, do something similar (this can be broken out as a generalized function), except when checking for overlap, combine elements with the first no_ovlp from previously:
grp_246 <- df[df$Group %in% c(2, 4, 6), ]
all_combn_246 <- lapply(cross(split(transpose(grp_246), grp_246$Group)), bind_rows)
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) - 1) / ((ncol(df) - 1) * 6)
})
It is not entirely clear how you want to calculate overlap for this part and compare with 25%. I counted duplicates and then divided by the number of columns (8 not counting Group) and multiply by 6 (rows). To see which combination of Group 2, 4, and 6 could be combined with no_ovlp you can try the following:
all_combn_246[combn_ovlp_246 < .25]
In my case, I believe none of the combinations met this criterion, although the first with 37.5% overlap was the minimum:
R> all_combn_246[[1]]
# A tibble: 3 x 9
Group Isol_1 Isol_2 Isol_3 Isol_4 Isol_5 Isol_6 Isol_7 Isol_8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 25 37 8 38 40 124 32 56
2 4 125 15 29 4 48 5 128 11
3 6 34 23 33 32 63 22 19 56
What was unclear is how to count duplicates. For example, how much overlap is c(1, 2, 3, 3, 3)?
This could be two duplicates (two extra 3's):
R> sum(table(x) - 1)
[1] 2
Or you could count number of values that have any duplicates (just the number 3 is duplicated):
R> sum(table(x) > 1)
[1] 1
If it is the latter, you could try:
combn_ovlp_246 <- lapply(all_combn_246, function(x) {
sum(table(c(unlist(x[-1]), unlist(no_ovlp[[1]][-1]))) > 1) / ((ncol(df) - 1) * 6)
})
By shamelessly stealing Ben's use of cross(), I have this approach that I personally find easier to read:
# Returns the number of overlapping elements
overlap <- function(xx){
length(unlist(xx)) - length(unique(unlist(xx)))
}
df_135 <- df %>%
as_tibble() %>%
filter(Group %in% c(1,3,5)) %>%
group_by(Group) %>%
mutate(Community = row_number()) %>%
nest(Members = starts_with("Isol_")) %>%
mutate(Members = map(Members, as.integer))
df_135
# A tibble: 12 x 3
# Groups: Group [3]
# Group Community Members
# <dbl> <chr> <list>
# 1 1 g1_1 <int [8]>
# 2 1 g1_2 <int [8]>
# 3 1 g1_3 <int [8]>
# 4 1 g1_4 <int [8]>
# 5 3 g3_1 <int [8]>
# 6 3 g3_2 <int [8]>
# 7 3 g3_3 <int [8]>
# 8 3 g3_4 <int [8]>
# 9 5 g5_1 <int [8]>
#10 5 g5_2 <int [8]>
#11 5 g5_3 <int [8]>
#12 5 g5_4 <int [8]>
# Compute all combinations across groups
all_combns <- cross(split(df_135$Members, df_135$Group))
# select the combinations with the desired overlap
all_combns[map_int(all_combns, overlap) == 0]
# [[1]]
# [[1]]$`1`
# [1] 125 8 40 127 19 33 29 3
#
# [[1]]$`3`
# [1] 11 25 126 22 56 4 6 52
#
# [[1]]$`5`
# [1] 5 63 18 48 37 32 43 1
Here's a plain R solution. It's not the most efficient one, but it's very straight forward and therefor very tractable.
The code below collects all the values in group 1 (1,3,5) and group 2 (2,4,6), and samples n isolates from this list. It then tests for the minimal overlap and resamples group 2 if necessary. In the case of your request, it only needs to resample once or twice, but if your threshold is lower (e.g. 0.05), it may resample up to 50 times before it gets it right. In fact, if your threshold is too low and your number of samples too large (i.e. it is impossible to make this sample), it will warn you that it failed.
Group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6)
Isol_1 = c(125, 25, 1, 126, 25, 128, 3, 128, 29, 15, 11, 18, 125, 6, 37, 4, 5, 19, 11, 4, 34, 32, 19, 1)
Isol_2 = c(8, 6, 56, 40, 37, 40, 125, 52, 4, 34, 25, 15, 15, 15, 23, 18, 63, 18, 22, 125, 23, 22, 11, 4)
Isol_3 = c(40, 34, 125, 63, 8, 25, 126, 48, 3, 125, 126, 37, 29, 126, 56, 29, 18, 40, 23, 25, 33, 43, 1, 11)
Isol_4 = c(127, 128, 8, 6, 38, 22, 25, 1, 63, 43, 22, 34, 4, 38, 22, 125, 48, 22, 126, 23, 32, 23, 23, 5)
Isol_5 = c(19, 4, 43, 125, 40, 37, 128, 125, 125, 23, 56, 43, 48, 48, 11, 33, 37, 63, 32, 63, 63, 48, 43, 52)
Isol_6 = c(33, 1, 128, 52, 124, 34, 15, 8, 40, 63, 4, 38, 5, 37, 8, 43, 32, 1, 19, 38, 22, 18, 56, 23)
Isol_7 = c(29, 63, 126, 128, 32, 63, 32, 11, 32, 33, 6, 6, 128, 19, 6, 15, 43, 33, 40, 11, 19, 56, 32, 18)
Isol_8 = c(3, 40, 34, 4, 56, 43, 52, 37, 38, 38, 52, 32, 11, 18, 33, 11, 1, 128, 37, 15, 56, 19, 5, 40)
df = cbind(Group, Isol_1, Isol_2, Isol_3, Isol_4, Isol_5, Isol_6, Isol_7, Isol_8)
df = as.data.frame(df)
subset1 <- df[df$Group %in% c(1,3,5),]
subset2 <- df[df$Group %in% c(2,4,6),]
values_in_subset1 <- subset1[2:ncol(subset1)] # Drop group column
values_in_subset1 <- as.vector(t(values_in_subset1)) # Convert to single vector
values_in_subset2 <- subset2[2:ncol(subset2)] # Drop group column
values_in_subset2 <- as.vector(t(values_in_subset2)) # Convert to single vector
n_sampled <- 8
sample1 <- sample(values_in_subset1, n_sampled, replace=F) #Replace=F is default, added here for readability
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
min_percentage_overlap <- 0.25
retries <- 1
# Retry until it gets it right
while(percentage_overlap > min_percentage_overlap && retries < 1000)
{
retries <- retries + 1
sample2 <- sample(values_in_subset2, n_sampled, replace=F) #Replace=F is default, added here for readability
percentage_overlap <- sum(sample1 %in% sample2)/n_sampled
}
# Report on number of attempts
cat(paste("Sampled", retries, "times to make sure there was less than", min_percentage_overlap*100,"% overlap."))
# Finally, check if it worked.
if(percentage_overlap <= min_percentage_overlap){
cat("It's super effective! (not really though)")
} else {
cat("But it failed!")
}

Iterate over columns with NAs to create percentile variables with dplyr and data.table

I need quite a simple thing. To iterate over columns of a dataset to create percentil versions of said columns. I tried with dplyr and data.table but none seem to do what I need. Particulary, I need to exclude de NA values when creating the percentile versions of the columns.
Reproducible example below:
values<-c(19,
6,
27,
63,
50,
59,
97,
89,
NA,
9,
31,
58,
83,
2,
1,
31,
3,
1,
27,
40,
32,
42,
99,
NA,
12,
16,
23,
98,
44,
25,
13,
70,
64,
NA,
37,
75,
73,
59,
21,
3,
76,
43,
6,
96,
55,
48,
70,
90,
18,
58,
22,
19,
26,
49,
59,
94,
31,
45,
20,
8,
26,
56,
7,
11,
98,
50,
41,
38,
86,
0,
37,
NA,
40,
7,
88,
38,
41,
41,
19,
34,
21,
64,
87,
22,
54,
39,
75,
72,
91,
78)
values2<- c(98,
60,
9,
98,
NA,
88,
NA,
54,
92,
90,
NA,
83,
92,
65,
44,
NA,
98,
40,
26,
40,
54,
56,
15,
90,
15,
63,
57,
NA,
85,
69,
73,
43,
24,
27,
82,
75,
29,
98,
29,
5,
91,
88,
28,
12,
53,
NA,
2,
42,
86,
2,
78,
20,
50,
73,
77,
NA,
4,
39,
90,
NA,
29,
14,
98,
88,
77,
79,
30,
9,
74,
93,
NA,
16,
27,
16,
18,
40,
NA,
2,
66,
71,
82,
10,
62,
84,
25,
NA,
15,
12,
85,
50)
groups<-c(1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2,
2)
df<-as.data.frame(cbind(groups,values,values2))
library(dplyr)
for (i in c("values","values2")) {
df<-df %>%
group_by(groups) %>%
mutate(!!sym(paste( i,"_percentile", sep="")) := percent_rank(na.omit(i)))
}
for (i in c("values","values2")) {
df<-df %>%
group_by(groups) %>%
mutate(!!sym(paste( i,"_percentile", sep="")) := rank(i)/length(i) )
}
library(data.table)
df<- as.data.table(df)
for (i in c("values","values2")) {
df[, paste(i,"_percentile",sep="") := rank(get(i))/length( get(i)), by = groups ]
}
for (i in c("values","values2")) {
df[!is.na(i), paste(i,"_percentile",sep="") := rank(get(i))/length( get(i)), by = groups ]
}
An option is mutate_at. After grouping by 'groups', use mutate_at to loop over the columns that starts_with ('values') as column name, replace, the values where the values are not NA with the percent_rank of the non-NA elements
library(dplyr)
df %>%
group_by(groups) %>%
mutate_at(vars(starts_with('values')),
list(percentile = ~ replace(., !is.na(.), percent_rank(.[!is.na(.)]))))
Or with data.table
library(data.table)
nm1 <- paste(names(df1)[2:3], "_percentile")
setDT(df)[, (nm1) := lapply(.SD, function(x) replace(x, !is.na(x),
frank(x[!is.na(x)])/sum(!is.na(x)))), .SDcols = 2:3, by = groups]
My tidyverse answer has the same structure as #akrun's -- using mutate_at to add multiple columns, starts_with to select the columns. A few things worth pointing out with the more minimal example:
The percent_rank function already removes NA's when it calculates, so you don't have to do the additional work to filter them out of the calc.
There is one degenerate case where there's only one actual measure. (In my case, it's group "b"). percent_rank can return a NaN value there because it's scaling the min_rank. Inside the direct mutate_at, that issue seems to be avoided. (It's unclear what value that should be assigned to in your case).
There's another sort-of degenerate case when there's a tie. In group "a", I have a tie for first place, and the percent_rank's are accordingly not 1.0.
library(tidyverse)
df <- tribble(
~groups, ~values1, ~values2,
"a", 1, 10,
"a", 2, 10,
"a", NA, 8,
"a", 3, 9,
"a", 4, 7,
"b", NA, 10,
"b", 2, NA,
"b", NA, 8
)
df %>%
group_by(groups) %>%
mutate_at(
vars(starts_with("values")),
list(percentile = ~ percent_rank(.)))
#> # A tibble: 8 x 5
#> # Groups: groups [2]
#> groups values1 values2 values1_percentile values2_percentile
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 10 0 0.75
#> 2 a 2 10 0.333 0.75
#> 3 a NA 8 NA 0.25
#> 4 a 3 9 0.667 0.5
#> 5 a 4 7 1 0
#> 6 b NA 10 NA 1
#> 7 b 2 NA 0 NA
#> 8 b NA 8 NA 0

R - Keep reading line if 7 or more numbers are => 10

I have a file foo.txt that looks like this:
7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5, 13, 34, 24, 17, 50, 30, 42, 35, 29, 27, 52, 35, 44, 52, 36, 39, 25, 40, 50, 52, 40, 2, 52, 52, 31, 35, 30, 19, 32, 46, 50, 43, 36, 15, 21, 16, 36, 25, 7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6
I want to read the numbers in sets of 15, moving to the right one number at the time:
7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5
then
3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5, 22
and so on.
If 7 or more of those 15 numbers are =>10 then keep them in a growing object that ends when the condition isn't met. So the first one to keep would be
3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5, 13
because 7 out of those 15 numbers are => 10 (those numbers are 22, 18, 14, 23, 16, 18 and 13
The output file would look like this:
3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5, 13, 34, 24, 17, 50, 30, 42, 35, 29, 27, 52, 35, 44, 52, 36, 39, 25, 40, 50, 52, 40, 2, 52, 52, 31, 35, 30, 19, 32, 46, 50, 43, 36, 15, 21, 16, 36, 25, 7, 3, 5, 7, 3, 3, 3, 3
So far I'm stuck at getting sets of 15 digits but I don't know how to make the condition "7 or more must be => 10"
qual <- readLines("foo.txt", 1)
separados <- unlist(strsplit(qual, ", "))
for (i in 1:length(qual)) {
separados[(i):(i + 14)] -> numbers
I don't mind the language as long as it does the work
I've added two ='s to Vlo's solutions and made this for you. Does this answer your question?
foo.txt <- c(7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6, 7, 5, 5, 22, 18, 14, 23, 16, 18, 5,
13, 34, 24, 17, 50, 30, 42, 35, 29, 27, 52, 35, 44, 52, 36, 39, 25, 40,
50, 52, 40, 2, 52, 52, 31, 35, 30, 19, 32, 46, 50, 43, 36, 15, 21, 16,
36, 25, 7, 3, 5, 7, 3, 3, 3, 3, 3, 3, 3, 6)
# install.packages(c("zoo"), dependencies = TRUE)
require(zoo)
bar <- rollapply(foo.txt, 15, function(x) sum(x >= 10 ) >= 7)
(product <- foo.txt[bar])
[1] 3 3 3 6 7 5 5 22 18 14 23 16 18 5 13 34 24 17 50 30 42 35 29 27
[25] 52 35 44 52 36 39 25 40 50 52 40 2 52 52 31 35 30 19 32 46 50 43 3 3
[49] 3 3 3 6
I would do it in Python (you said you don't mind the language):
array = []
with open("foo.txt","r") as f:
for line in f:
for num in line.strip().split(', '):
array.append(int(num))
result = []
growing = False
while len(array) >= 15:
if sum(1 for e in filter(lambda x: x>=10, array[:15])) >= 7:
if growing:
result.append(array[15])
else:
result.extend(array[:15])
growing = True
else:
growing = False
del(array[0])
print(str(result)[1:-1])
Short explanation: first while simply reads the lines in the file, strips end of line, separates every number between ", " characters and appends each number to array.
Second while checks the first 15 numbers in array; if they have at least 7 numbers >= 0, it appends all the numbers, or just the last one (depending if the last iteration), to result. At the end of the loop, it removes the first number in array so that the loop can continue with the next 15 numbers.

Adding a row with Sum and mean of the columns

I'm having a dataframe as like below.
`> am_me
Group.1 Group.2 x.x x.y
2 AM clearterminate 3 21.00000
3 AM display.cryptic 86 30.12791
4 AM price 71 898.00000`
I would like to get result as like below.
`> am_me_t
Group.2 x.x x.y
2 clearterminate 3 21
3 display.cryptic 86 30.1279069767442
4 price 71 898
41 AM 160 316.375968992248`
I have taken out the first column and got the result like below
`> am_res
Group.2 x.x x.y
2 clearterminate 3 21.00000
3 display.cryptic 86 30.12791
4 price 71 898.00000`
When I try rbind to Add "AM" to new row, as like below, I'm getting a warning message and getting NA.
`> am_me_t <- rbind(am_res, c("AM", colSums(am_res[2]), colMeans(am_res[3])))
Warning message:
invalid factor level, NAs generated in: "[<-.factor"(`*tmp*`, ri, value = "AM")
Group.2 x.x x.y
2 clearterminate 3 21
3 display.cryptic 86 30.1279069767442
4 price 71 898
41 <NA> 160 316.375968992248`
For your information, Output of edit(am_me)
`> edit(am_me)
structure(list(Group.1 = structure(as.integer(c(2, 2, 2)), .Label = c("1Y",
"AM", "BE", "CM", "CO", "LX", "SN", "US", "VK", "VS"), class = "factor"),
Group.2 = structure(as.integer(c(2, 5, 9)), .Label = c("bestbuy",
"clearterminate", "currency.display", "display", "display.cryptic",
"fqa", "mileage.display", "ping", "price", "reissue", "reissuedisplay",
"shortaccess.followon"), class = "factor"), x.x = as.integer(c(3,
86, 71)), x.y = c(21, 30.1279069767442, 898)), .Names = c("Group.1",
"Group.2", "x.x", "x.y"), row.names = c("2", "3", "4"), class = "data.frame")`
Also
`> edit(me)
structure(list(Group.1 = structure(as.integer(c(1, 2, 2, 2, 3,
4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8,
8, 8, 9, 9, 10, 10, 10, 10, 10, 10)), .Label = c("1Y", "AM",
"BE", "CM", "CO", "LX", "SN", "US", "VK", "VS"), class = "factor"),
Group.2 = structure(as.integer(c(8, 2, 5, 9, 10, 1, 2, 5,
9, 1, 2, 5, 9, 1, 2, 3, 4, 7, 9, 11, 12, 2, 4, 6, 1, 2, 5,
9, 2, 5, 1, 2, 3, 5, 9, 10)), .Label = c("bestbuy", "clearterminate",
"currency.display", "display", "display.cryptic", "fqa",
"mileage.display", "ping", "price", "reissue", "reissuedisplay",
"shortaccess.followon"), class = "factor"), x.x = as.integer(c(1,
3, 86, 71, 1, 2, 5, 1, 52, 10, 7, 27, 15, 5, 267, 14, 4,
1, 256, 1, 1, 80, 1, 78, 2, 10, 23, 6, 1, 2, 4, 3, 3, 11,
1, 1)), x.y = c(5, 21, 30.1279069767442, 898, 12280, 800,
56.4, 104, 490.442307692308, 1759.1, 18.1428571428571, 1244.81481481481,
518.533333333333, 3033.2, 18.5468164794007, 20, 3788.5, 23,
2053.49609375, 3863, 6376, 17.825, 240, 1752.21794871795,
1114.5, 34, 1369.60869565217, 1062.16666666667, 23, 245,
5681.5, 11.3333333333333, 13.3333333333333, 1273.81818181818,
2076, 5724)), .Names = c("Group.1", "Group.2", "x.x", "x.y"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31",
"32", "33", "34", "35", "36"), class = "data.frame")
Group.1 Group.2 x.x x.y
1 1Y ping 1 5.00000
2 AM clearterminate 3 21.00000
3 AM display.cryptic 86 30.12791
4 AM price 71 898.00000
5 BE reissue 1 12280.00000
6 CM bestbuy 2 800.00000
7 CM clearterminate 5 56.40000
8 CM display.cryptic 1 104.00000
9 CM price 52 490.44231
10 CO bestbuy 10 1759.10000
11 CO clearterminate 7 18.14286
12 CO display.cryptic 27 1244.81481
13 CO price 15 518.53333
14 LX bestbuy 5 3033.20000
15 LX clearterminate 267 18.54682
16 LX currency.display 14 20.00000
17 LX display 4 3788.50000
18 LX mileage.display 1 23.00000
19 LX price 256 2053.49609
20 LX reissuedisplay 1 3863.00000
21 LX shortaccess.followon 1 6376.00000
22 SN clearterminate 80 17.82500
23 SN display 1 240.00000
24 SN fqa 78 1752.21795
25 US bestbuy 2 1114.50000
26 US clearterminate 10 34.00000
27 US display.cryptic 23 1369.60870
28 US price 6 1062.16667
29 VK clearterminate 1 23.00000
30 VK display.cryptic 2 245.00000
31 VS bestbuy 4 5681.50000
32 VS clearterminate 3 11.33333
33 VS currency.display 3 13.33333
34 VS display.cryptic 11 1273.81818
35 VS price 1 2076.00000
36 VS reissue 1 5724.00000`
The type of the Group.2 column is factor, and that limits the possible values. You can transform it to character with am_me$Group.2 <- as.character(am_me$Group.2), after that the AM value will be added without errors.
Note that you can also use sum() and mean() for single column operations.

Resources