Group by partial string matches - r

I have a table with a list of categories each with a count value that i'd like to collapse across based on similarity ... for example Mariner-1_Amel and Mariner-10 would be a single category of Mariner and anything with 'Jockey' or 'hAT' in the name should be collapsed across.
I'm struggling to find a solution that can cope with all the possibilities. Is there an easy dplyr solution?
reproducible with
> dput(tibs)
structure(list(type = c("(TTAAG)n_1", "AMARI_1", "Copia-4_LH-I",
"DNA", "DNA-1_CQ", "DNA/hAT-Charlie", "DNA/hAT-Tip100", "DNA/MULE-MuDR",
"DNA/P", "DNA/PiggyBac", "DNA/TcMar-Mariner", "DNA/TcMar-Tc1",
"DNA/TcMar-Tigger", "G3_DM", "Gypsy-10_CFl-I", "hAT-1_DAn", "hAT-16_SM",
"hAT-N4_RPr", "HELITRON7_CB", "Jockey-1_DAn", "Jockey-1_DEl",
"Jockey-12_DF", "Jockey-5_DTa", "Jockey-6_DYa", "Jockey-6_Hmel",
"Jockey-7_HMM", "Jockey-8_Hmel", "LINE/Dong-R4", "LINE/I", "LINE/I-Jockey",
"LINE/I-Nimb", "LINE/Jockey", "LINE/L1", "LINE/L2", "LINE/R1",
"LINE/R2", "LINE/R2-NeSL", "LINE/Tad1", "LTR/Gypsy", "Mariner_CA",
"Mariner-1_AMel", "Mariner-10_HSal", "Mariner-13_ACe", "Mariner-15_HSal",
"Mariner-16_DAn", "Mariner-19_RPr", "Mariner-30_SM", "Mariner-39_SM",
"Mariner-42_HSal", "Mariner-46_HSal", "Mariner-49_HSal", "TE-5_EL",
"Unknown", "Utopia-1_Crp"), n = c(1L, 1L, 1L, 2L, 1L, 18L, 3L,
9L, 2L, 8L, 21L, 12L, 18L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 7L, 2L, 7L, 24L, 1L, 1L, 5L, 3L, 1L,
1L, 7L, 1L, 5L, 1L, 1L, 5L, 5L, 1L, 1L, 3L, 5L, 5L, 2L, 1L, 190L,
1L)), row.names = c(NA, -54L), class = c("tbl_df", "tbl", "data.frame"
))

It seems to me that your broader types are mostly/entirely at the beginning of the string. You could therefore use just the first alphanumerical sequence ([[:alnum:]]+) of the type as broader types. This would give you the following types:
library(tidyverse)
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
count(type_short, sort = TRUE)
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 Mariner 12
#> 2 LINE 11
#> 3 DNA 10
#> 4 Jockey 8
#> 5 hAT 3
#> 6 AMARI 1
#> 7 Copia 1
#> 8 G3 1
#> 9 Gypsy 1
#> 10 HELITRON7 1
#> 11 LTR 1
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 1
#> 15 Utopia 1
You can easily use the new column to group_by:
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
group_by(type_short) %>%
summarise(n = sum(n))
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 AMARI 1
#> 2 Copia 1
#> 3 DNA 94
#> 4 G3 1
#> 5 Gypsy 3
#> 6 hAT 5
#> 7 HELITRON7 1
#> 8 Jockey 10
#> 9 LINE 54
#> 10 LTR 7
#> 11 Mariner 35
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 190
#> 15 Utopia 1
Theoretically, you could also try to use string similarity here. Yet your types do not have great similarity among themselves. A relative Levenshtein distance (distance / characters of the longer string) for example retrieves results like this:
strings <- c("Mariner-1_Amel", "Mariner-10")
adist(strings) / max(nchar(strings))
#> [,1] [,2]
#> [1,] 0.0000000 0.3571429
#> [2,] 0.3571429 0.0000000
This could be interpreted as the two types being 36% similar. Finding a good threshold might be hard in that case.

This solution uses package dplyr function case_when and base R grepl.
library(dplyr)
tibs %>%
mutate(category = case_when(
grepl("hAT|Jockey", type) ~ "Jokey",
grepl("Mariner", type) ~ "Mariner",
grepl("DNA", type) ~ "DNA",
grepl("LINE", type) ~"LINE",
TRUE ~ as.character(type)
),
category = factor(category)
)

If there is no commonality to define the groups you can define individual conditions using case_when.
library(dplyr)
library(stringr)
tibs %>%
mutate(category = case_when(str_detect(type, 'Mariner-\\d+') ~ 'Mariner',
str_detect(type, 'Jockey|hAT') ~ 'common',
#Add more conditions
))

Related

Looping through a column in R as variable changes

I am a novice trying to analyze trap catch data in R and am looking for an efficient way to loop through by trap line. The first column is trap ID. The second column is the trap line that each trap is associated with. The remaining columns are values related to target catch and bycatch for each visit to the traps. I want to write code that will evaluate the data during each visit for each trap line. Here is an example of data I am working with:
Sample Data:
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
The number of traps per trapline varies. I have a code that I wrote out for each Trapline (there are 14 different traplines), but I was hoping there would be a way to consolidate it into one line of code that would calculate values while the trapline was constant, and then when it changed to the next trapline it would start a new calculation. Here is an example of how I was finding the sum of bycatch found at the Cemetery Trapline for visit 1.
CemetaryBycatch1 <- Data %>% select(Bycatch Visit 1 %>% filter(Data$Trapline == "Cemetery")
sum(CemetaryBycatch1)
As of right now I have code like this written out for each trapline for each visit, but with 14 traplines and 8 total visits, I would like to avoid having to write out so many lines of code and was hoping there was a way to loop through it with one block of code that would calculate value (sum, mean, etc.) for each trap line.
Thanks
Does something like this help you?
You can add a filter for Trapline in between group_by and summarise_all.
Code:
library(dplyr)
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
df
Data %>%
group_by(Trap_ID, Trapline) %>%
summarise_all(list(sum))
Output:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 0 3 1 4
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)
Adding another row to Data:
Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
1 Cemetery 100 200 1 4
Will give you:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 100 203 2 8
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)

R aggregate column until one condition is met

so I´m having a dataframe of this form:
ID Var1 Var2
1 1 1
1 2 2
1 3 3
1 4 2
1 5 2
2 1 4
2 2 8
2 3 10
2 4 10
2 5 7
and I would like to filter the Var1 values by group for their maximum, on the condition, that the maximum value of Var2 is not met. This will be part of a new dataframe only containing one row per ID, so the outcome should be something like this:
ID Var1
1 2
2 2
so the function should filter the dataframe for the maximum, but only consider the values in the rows before Var2 reaches it´s maximum. The rows containing the maximum itself should not be included and so shouldn´t the rows after the maximum.
I tried building something with the while loop, but it didn´t work out. Also I´d be thankful if the solution doesn´t employ data.table
Thanks in advance
Maybe you could do something like this:
DF <- structure(list(
ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L),
Var2 = c(1L, 2L, 3L, 2L, 2L, 4L, 8L, 10L, 10L, 7L)),
class = "data.frame", row.names = c(NA, -10L))
library(dplyr)
DF %>% group_by(ID) %>%
slice(1:(which.max(Var2)-1)) %>%
slice_max(Var1) %>%
select(ID, Var1)
#> # A tibble: 2 x 2
#> # Groups: ID [2]
#> ID Var1
#> <int> <int>
#> 1 1 2
#> 2 2 2
Created on 2020-08-04 by the reprex package (v0.3.0)

Complex If Else Statement in For Loop in R Warning Message

I made a for loop with many if else statements for my dataset and 2 empty vectors. However, I am getting a warning message saying:
In closenessSupport[i] <- rowMeans(seniorEdPlans[c("closenessFriends", ... :
number of items to replace is not a multiple of replacement length.
I just wondering on how to fix this vector length problem because I think it is messing with my intention to find the mean of 2 columns.. Any help appreciated.
Wow, way too many ith's for me. But a few nudges towards an answer. You definitely don't want a for loop down all the rows of your dataframe in this case. r is optimized to work on columns. I'm not totally sure I understand all your conditionals, but most likely dplyr::case_when will serve you well.
I grabbed your data and dputted just the first 20 rows. Then I wrote a mutate and case_when that produces a start towards closenessSupport. Is this sort of what you're out to do?
Revised after your additional input just the columns of interest
# https://stackoverflow.com/questions/61582653
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
seniored <- structure(list(id = 1:20,
age = c(17L, 16L, 17L, 16L, 17L, 18L,
17L, 17L, 18L, 16L, 17L, 17L, 17L, 17L, 17L, 17L, 16L, 17L, 16L,
18L),
higherEd = structure(c(1L, 5L, 1L, 1L, 3L, 1L, 2L, 2L,
5L, 5L, 3L, 4L, 3L, 2L, 5L, 3L, 4L, 5L, 1L, 1L), .Label = c("2-year",
"4-year", "None", "Other", "Vocational"), class = "factor"),
riskGroup = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 3L,
1L, 3L, 3L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 3L), .Label = c("High",
"Low", "Medium"), class = "factor"),
GPA = c(3.169, 2.703,
3.225, 2.488, 2.618, 2.928, 3.176, 3.256, 3.48, 3.244, 3.265,
3.4, 3.109, 3.513, 3.102, 2.656, 2.853, 3.046, 2.304, 3.473
),
closenessFriends = c(7L, 7L, 7L, 8L, NA, NA, NA, 6L, 7L,
NA, 5L, 6L, 3L, 1L, 1L, NA, 8L, 2L, NA, 8L),
closenessMentors = c(6L,
NA, 5L, NA, 5L, 4L, 8L, 6L, 4L, 5L, 4L, 4L, 4L, 5L, 5L, 5L,
3L, 4L, NA, 5L),
numSupportSources = c(2L, 1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 0L, 2L
)), row.names = c(NA, 20L), class = "data.frame")
seniored %>%
mutate(
closenessSupport = case_when(
numSupportSources == 1 & !is.na(closenessFriends) ~ as.numeric(closenessFriends),
numSupportSources == 1 & !is.na(closenessMentors) ~ as.numeric(closenessMentors),
numSupportSources == 2 ~ (closenessFriends + closenessMentors)/2,
numSupportSources == 0 ~ NA_real_),
supportType = case_when(
numSupportSources == 1 & !is.na(closenessFriends) ~ "FriendOnly",
numSupportSources == 1 & !is.na(closenessMentors) ~ "MentorOnly",
numSupportSources == 2 ~ "Both",
numSupportSources == 0 ~ "Neither"
)
) %>%
select(numSupportSources, closenessFriends, closenessMentors, closenessSupport, supportType)
#> numSupportSources closenessFriends closenessMentors closenessSupport
#> 1 2 7 6 6.5
#> 2 1 7 NA 7.0
#> 3 2 7 5 6.0
#> 4 1 8 NA 8.0
#> 5 1 NA 5 5.0
#> 6 1 NA 4 4.0
#> 7 1 NA 8 8.0
#> 8 2 6 6 6.0
#> 9 2 7 4 5.5
#> 10 1 NA 5 5.0
#> 11 2 5 4 4.5
#> 12 2 6 4 5.0
#> 13 2 3 4 3.5
#> 14 2 1 5 3.0
#> 15 2 1 5 3.0
#> 16 1 NA 5 5.0
#> 17 2 8 3 5.5
#> 18 2 2 4 3.0
#> 19 0 NA NA NA
#> 20 2 8 5 6.5
#> supportType
#> 1 Both
#> 2 FriendOnly
#> 3 Both
#> 4 FriendOnly
#> 5 MentorOnly
#> 6 MentorOnly
#> 7 MentorOnly
#> 8 Both
#> 9 Both
#> 10 MentorOnly
#> 11 Both
#> 12 Both
#> 13 Both
#> 14 Both
#> 15 Both
#> 16 MentorOnly
#> 17 Both
#> 18 Both
#> 19 Neither
#> 20 Both
Created on 2020-05-04 by the reprex package (v0.3.0)
Please accept one of the right answers
You had asked yesterday what was wrong with your loop. I looked today. The issue was running the rowwise inside the loop. It's already based on rows so running it inside a for loop that iterates through your rows is bound to cause problems.
I also made an example data set with representative values for your data. May not matter for your current data but a for loop will be much slower. In the 20,000 rows case a for loop took 1.4 seconds. The dplyr solution 11 milliseconds.
# build a reproducible dataset assume valid scores 1 - 8
# we'll make 9's equal to NA
set.seed(2020)
a <- sample(1:9, 20000, replace = TRUE)
a[a == 9] <- NA
set.seed(2021)
b <- sample(1:9, 20000, replace = TRUE)
b[b == 9] <- NA
seniorEdPlans2 <- data.frame(closenessFriends = a,
closenessMentors = b)
# use apply to calculate numSupportSources
seniorEdPlans2$numSupportSources <- apply(seniorEdPlans2,
1,
function(x) sum(!is.na(x))
)
# head(seniorEdPlans2, 50) # close enough
# this was the source of your error message it's already
# row based so can't put it in a for loop
seniorEdPlans2$closenessSupport <- rowMeans(seniorEdPlans2[c('closenessFriends', 'closenessMentors')],
na.rm = TRUE)
# your for loop
for (i in 1:nrow(seniorEdPlans2)) {
if (seniorEdPlans2$numSupportSources[i] == 2) {
seniorEdPlans2$supportType[i] <- "Both"
} else if (seniorEdPlans2$numSupportSources[i] == 0) {
seniorEdPlans2$supportType[i] <- "Neither"
} else if (!is.na(seniorEdPlans2$closenessFriends[i])) {
seniorEdPlans2$supportType[i] <- "FriendOnly"
} else {
seniorEdPlans2$supportType[i] <- "MentorOnly"
}
}
# head(seniorEdPlans2, 50)
Created on 2020-05-05 by the reprex package (v0.3.0)

Checking that two different variables match

I have a dataset that I need to run a check of two variables "Call" and "cluster".
I need to check that calls of the same value all have the same cluster number.
The problem is the K-means clustering method I used assigns different cluster numbers each time I run it.
So I need a check that looks at the allele 1 call for example and checks that the numbers are all the same under the cluster call. (the numbers could be listed as 2 one run and the change to 3 if I re-run the code)
Please see data structure below
structure(list(RFU1 = c(-0.295502405, 0.964070798, 3381.332182,
3532.769062, 3431.836843, 3242.966511, 2104.791167, 2220.008503,
3548.252161, 3506.51418, 2290.273178, 2281.587684, -5.64819475,
-11.73109864, 3784.914039, 3619.00781, 3618.211608, 3248.106466,
3394.650325, 3339.870196, 2449.202902, 2426.835174, 3432.153478,
2761.392304, -9.267907504, -7.365704356, 3743.092314, 3787.241702,
2172.027787, 2096.845649, 2135.649551, 2149.145547, 2293.757257,
2348.099108, 2321.019045, 2022.168867, -17.93532331, -12.59832941,
3805.416768, 3498.998138, 2304.597239, 2509.63987, 2181.11547,
2261.011876, 3432.453036, 3662.758933, 2371.11049, 3068.827061,
2612.107589, 2687.824075, 3179.315918, 3688.525218, 3465.327523,
3405.154043, 2535.514915, 2452.200079, 374.435299, 423.6015308,
3742.515563, 3578.777925, 2634.955017, 2527.514043, 3817.579252,
3550.999412, -10.72035816, 3294.486334, 3352.40368, 3463.150507,
3472.576514, 3741.898759, 3571.369947, 3720.645869, 3739.569593,
3855.583168, 418.6837047, 49.47548241, 2171.034284, 2155.314713,
3432.363384, 3582.508917, 3425.415274, 3487.203299, 3505.23909,
3413.342096, 113.5100691, 128.6414053, 2454.588175, 2323.061591,
3188.705702, 3376.950546, 3291.072437, 3181.001961, 3195.013863,
3776.919197, 2284.22659, 2277.338631), RFU2 = c(-8.346468029,
235.4058561, 637.9218251, 650.3759507, 617.4161748, 604.0792911,
4270.310727, 4199.615749, 689.863543, 712.6144338, 4274.287194,
4541.168491, -1.626221758, -2.437395631, 802.0941252, 730.5998997,
686.9037384, 625.8245403, 644.3586836, 642.8833044, 4937.691887,
5159.479928, 725.4449756, 573.3910899, -4.006398006, 213.2859144,
739.7910786, 731.0150586, 4408.81923, 4767.533969, 4302.641493,
4325.913445, 4597.47663, 4666.904418, 4800.357526, 4142.535329,
-17.23239968, 178.5311942, 778.305843, 743.1438168, 4214.507094,
4553.703511, 4629.339014, 4459.697405, 661.7299014, 727.1054982,
4553.170272, 5482.231486, 4520.517999, 4737.802036, 643.3599887,
726.4314715, 696.5968338, 697.6099599, 411.8118071, 409.4943424,
5687.32635, 5757.51512, 766.4240193, 779.2403225, 4745.055632,
4582.267792, 749.5679421, 675.8747055, -7.254521898, 628.3467565,
631.116767, 672.7064514, 687.2642132, 718.1192917, 731.785499,
668.3686048, 784.8055727, 791.3155894, 4471.047168, 4501.597841,
4504.670332, 4442.621066, 682.0632225, 706.6204595, 680.5242182,
683.9558692, 684.2909706, 618.6535251, 5727.684954, 6098.485474,
5099.952926, 4779.742057, 571.4303822, 614.9258218, 602.9830491,
651.2847695, 591.8833499, 742.2387568, 4443.376841, 4716.792177
), cluster = c(2L, 2L, 4L, 4L, 4L, 4L, 1L, 1L, 4L, 4L, 1L, 1L,
2L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 4L, 4L, 2L, 2L, 4L, 4L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 4L, 4L, 1L, 1L, 1L, 1L,
4L, 4L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 4L,
1L, 1L, 4L, 4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 1L, 1L, 4L, 4L, 4L, 4L,
4L, 4L, 1L, 1L)), row.names = c(NA, -96L), class = "data.frame")
First run of k-means clustering
First run
Second run of k-means clustering
Second run
Since it is a question of labels, you can always relabel your groups afterwards. I used 4 clusters based on the within sum of square criteria :
suppressPackageStartupMessages( library(dplyr) )
dfr <- as_tibble(dfr)
groups <- lapply(1:10, function(ct) kmeans(x = dfr[,c("RFU1", "RFU2")], centers = ct)$tot.withinss) # as.matrix(
# plot(unlist(groups))
Method 1 : fix the random seed
# --- Experiment 1
set.seed(123)
groups <- kmeans(x = dfr[,c("RFU1", "RFU2")], centers = 4)
dfr$cluster <- groups$cluster
dfr
#> # A tibble: 96 x 4
#> id RFU1 RFU2 cluster
#> <chr> <dbl> <dbl> <int>
#> 1 1 -0.296 -8.35 4
#> 2 2 0.964 235. 4
#> 3 3 3381. 638. 2
#> 4 4 3533. 650. 2
#> 5 5 3432. 617. 2
#> 6 6 3243. 604. 3
#> 7 7 2105. 4270. 1
#> 8 8 2220. 4200. 1
#> 9 9 3548. 690. 2
#> 10 10 3507. 713. 2
#> # ... with 86 more rows
# --- Experiment 2
set.seed(123)
groups <- kmeans(x = dfr[,c("RFU1", "RFU2")], centers = 4, )
dfr$cluster2 <- groups$cluster
dfr
#> # A tibble: 96 x 5
#> id RFU1 RFU2 cluster cluster2
#> <chr> <dbl> <dbl> <int> <int>
#> 1 1 -0.296 -8.35 4 4
#> 2 2 0.964 235. 4 4
#> 3 3 3381. 638. 2 2
#> 4 4 3533. 650. 2 2
#> 5 5 3432. 617. 2 2
#> 6 6 3243. 604. 3 3
#> 7 7 2105. 4270. 1 1
#> 8 8 2220. 4200. 1 1
#> 9 9 3548. 690. 2 2
#> 10 10 3507. 713. 2 2
#> # ... with 86 more rows
Method 2 : labels don't matter, groups do so don't do anything
Method 3 : labels do matter
3.1 my data is ordinate, label per cluster centers
lab <- order(order(groups$centers[,1])) # c(2, 4, 3, 1)
# head( cbind(dfr$cluster2, lab[dfr$cluster2] ) )
dfr <- dfr %>%
mutate(label1 = lab[dfr$cluster2] )
3.2 my data has important individuals, label using references
# Suppose individuals 1, 4, 6 and 7 could name the classes
dfr <- dfr %>% tibble::rownames_to_column(var = "id") %>%
mutate(id = paste0("id_", id) )
refs <- tibble(id = dfr$id[c(1, 4, 6, 7)],
cluster = dfr$cluster2[c(1, 4, 6, 7)]
)
dfr %>%
mutate(label2 = refs$id[ c(2, 4, 3, 1)[cluster2] ]
# label2 = letters[1:4][cluster2]
)
#> # A tibble: 96 x 7
#> id RFU1 RFU2 cluster cluster2 label1 label2
#> <chr> <dbl> <dbl> <int> <int> <int> <chr>
#> 1 1 -0.296 -8.35 4 4 1 1
#> 2 2 0.964 235. 4 4 1 1
#> 3 3 3381. 638. 2 2 4 7
#> 4 4 3533. 650. 2 2 4 7
#> 5 5 3432. 617. 2 2 4 7
#> 6 6 3243. 604. 3 3 3 6
#> 7 7 2105. 4270. 1 1 2 4
#> 8 8 2220. 4200. 1 1 2 4
#> 9 9 3548. 690. 2 2 4 7
#> 10 10 3507. 713. 2 2 4 7
#> # ... with 86 more rows

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)

Resources