Looping through a column in R as variable changes - r

I am a novice trying to analyze trap catch data in R and am looking for an efficient way to loop through by trap line. The first column is trap ID. The second column is the trap line that each trap is associated with. The remaining columns are values related to target catch and bycatch for each visit to the traps. I want to write code that will evaluate the data during each visit for each trap line. Here is an example of data I am working with:
Sample Data:
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
The number of traps per trapline varies. I have a code that I wrote out for each Trapline (there are 14 different traplines), but I was hoping there would be a way to consolidate it into one line of code that would calculate values while the trapline was constant, and then when it changed to the next trapline it would start a new calculation. Here is an example of how I was finding the sum of bycatch found at the Cemetery Trapline for visit 1.
CemetaryBycatch1 <- Data %>% select(Bycatch Visit 1 %>% filter(Data$Trapline == "Cemetery")
sum(CemetaryBycatch1)
As of right now I have code like this written out for each trapline for each visit, but with 14 traplines and 8 total visits, I would like to avoid having to write out so many lines of code and was hoping there was a way to loop through it with one block of code that would calculate value (sum, mean, etc.) for each trap line.
Thanks

Does something like this help you?
You can add a filter for Trapline in between group_by and summarise_all.
Code:
library(dplyr)
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
df
Data %>%
group_by(Trap_ID, Trapline) %>%
summarise_all(list(sum))
Output:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 0 3 1 4
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)
Adding another row to Data:
Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
1 Cemetery 100 200 1 4
Will give you:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 100 203 2 8
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)

Related

How to calculate average variation per groups in r?

I have a dataset of groups of genes with each gene having a different score. I am looking to calculate the average gene score and average variation/difference of scores between genes per group.
For example my data looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I am looking to add another column giving the average model score per group and a column for the average variation between scores per group.
So far for the average score per group, I am using
group_average_score <- aggregate( Score ~ Group, df, mean )
Although I am struggling to get this added as an additional column in the data.
Then for taking the average variation score per group I've been trying to go from a similar question (Calculate difference between values by group and matched for time) but I'm struggling to adjust this for my data. I've tried:
test <- df %>%
group_by(Group) %>%
mutate(Diff = c(NA, diff(Score)))
But I'm not sure this is calculating the average variation out of all gene's Score per group. The output using my real data gives a couple different variation average scores per group when there should be just one.
Expected output should look something like:
Group Gene Score direct_count secondary_count Average_Score Average_Score_Difference
1 AQP11 0.5566507 4 5 0.46160593 0.183650
1 CLNS1A 0.2811747 0 2 0.46160593 0.183650
1 RSF1 0.5469924 3 6 0.46160593 0.183650
2 CFDP1 0.4186066 1 2 ... ...
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I think the Average_Score_Difference is fine but just to note I've done it by hand for sake of example (differences each gene has with each other summed and divided by 3 for Group 1).
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
Try this solution with dplyr but more infor about how to compute last column should be provided:
library(dplyr)
#Code
newdf <- df %>% group_by(Group) %>% mutate(Avg=mean(Score,na.rm = T),
Diff=c(0,abs(diff(Score))),
AvgPerc=mean(Diff,na.rm=T))
Output:
# A tibble: 7 x 8
# Groups: Group [3]
Group Gene Score direct_count secondary_count Avg Diff AvgPerc
<int> <chr> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 1 AQP11 0.557 4 5 0.462 0 0.180
2 1 CLNS1A 0.281 0 2 0.462 0.275 0.180
3 1 RSF1 0.547 3 6 0.462 0.266 0.180
4 2 CFDP1 0.419 1 2 0.424 0 0.00545
5 2 CHST6 0.430 1 3 0.424 0.0109 0.00545
6 3 ACE 0.634 1 1 0.634 0 0.000250
7 3 NOS2 0.634 1 1 0.634 0.000500 0.000250
Some data used:
#Data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))
Using data.table
library(data.table)
setDT(df)[, c('Avg', 'Diff') := .(mean(Score, na.rm = TRUE),
c(0, abs(diff(Score)))), Group][, AvgPerc := mean(Diff), Group]
data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))

R aggregate column until one condition is met

so I´m having a dataframe of this form:
ID Var1 Var2
1 1 1
1 2 2
1 3 3
1 4 2
1 5 2
2 1 4
2 2 8
2 3 10
2 4 10
2 5 7
and I would like to filter the Var1 values by group for their maximum, on the condition, that the maximum value of Var2 is not met. This will be part of a new dataframe only containing one row per ID, so the outcome should be something like this:
ID Var1
1 2
2 2
so the function should filter the dataframe for the maximum, but only consider the values in the rows before Var2 reaches it´s maximum. The rows containing the maximum itself should not be included and so shouldn´t the rows after the maximum.
I tried building something with the while loop, but it didn´t work out. Also I´d be thankful if the solution doesn´t employ data.table
Thanks in advance
Maybe you could do something like this:
DF <- structure(list(
ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L),
Var2 = c(1L, 2L, 3L, 2L, 2L, 4L, 8L, 10L, 10L, 7L)),
class = "data.frame", row.names = c(NA, -10L))
library(dplyr)
DF %>% group_by(ID) %>%
slice(1:(which.max(Var2)-1)) %>%
slice_max(Var1) %>%
select(ID, Var1)
#> # A tibble: 2 x 2
#> # Groups: ID [2]
#> ID Var1
#> <int> <int>
#> 1 1 2
#> 2 2 2
Created on 2020-08-04 by the reprex package (v0.3.0)

Group by partial string matches

I have a table with a list of categories each with a count value that i'd like to collapse across based on similarity ... for example Mariner-1_Amel and Mariner-10 would be a single category of Mariner and anything with 'Jockey' or 'hAT' in the name should be collapsed across.
I'm struggling to find a solution that can cope with all the possibilities. Is there an easy dplyr solution?
reproducible with
> dput(tibs)
structure(list(type = c("(TTAAG)n_1", "AMARI_1", "Copia-4_LH-I",
"DNA", "DNA-1_CQ", "DNA/hAT-Charlie", "DNA/hAT-Tip100", "DNA/MULE-MuDR",
"DNA/P", "DNA/PiggyBac", "DNA/TcMar-Mariner", "DNA/TcMar-Tc1",
"DNA/TcMar-Tigger", "G3_DM", "Gypsy-10_CFl-I", "hAT-1_DAn", "hAT-16_SM",
"hAT-N4_RPr", "HELITRON7_CB", "Jockey-1_DAn", "Jockey-1_DEl",
"Jockey-12_DF", "Jockey-5_DTa", "Jockey-6_DYa", "Jockey-6_Hmel",
"Jockey-7_HMM", "Jockey-8_Hmel", "LINE/Dong-R4", "LINE/I", "LINE/I-Jockey",
"LINE/I-Nimb", "LINE/Jockey", "LINE/L1", "LINE/L2", "LINE/R1",
"LINE/R2", "LINE/R2-NeSL", "LINE/Tad1", "LTR/Gypsy", "Mariner_CA",
"Mariner-1_AMel", "Mariner-10_HSal", "Mariner-13_ACe", "Mariner-15_HSal",
"Mariner-16_DAn", "Mariner-19_RPr", "Mariner-30_SM", "Mariner-39_SM",
"Mariner-42_HSal", "Mariner-46_HSal", "Mariner-49_HSal", "TE-5_EL",
"Unknown", "Utopia-1_Crp"), n = c(1L, 1L, 1L, 2L, 1L, 18L, 3L,
9L, 2L, 8L, 21L, 12L, 18L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L, 2L, 7L, 2L, 7L, 24L, 1L, 1L, 5L, 3L, 1L,
1L, 7L, 1L, 5L, 1L, 1L, 5L, 5L, 1L, 1L, 3L, 5L, 5L, 2L, 1L, 190L,
1L)), row.names = c(NA, -54L), class = c("tbl_df", "tbl", "data.frame"
))
It seems to me that your broader types are mostly/entirely at the beginning of the string. You could therefore use just the first alphanumerical sequence ([[:alnum:]]+) of the type as broader types. This would give you the following types:
library(tidyverse)
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
count(type_short, sort = TRUE)
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 Mariner 12
#> 2 LINE 11
#> 3 DNA 10
#> 4 Jockey 8
#> 5 hAT 3
#> 6 AMARI 1
#> 7 Copia 1
#> 8 G3 1
#> 9 Gypsy 1
#> 10 HELITRON7 1
#> 11 LTR 1
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 1
#> 15 Utopia 1
You can easily use the new column to group_by:
df %>%
mutate(type_short = str_extract(type, "[[:alnum:]]+")) %>%
group_by(type_short) %>%
summarise(n = sum(n))
#> # A tibble: 15 x 2
#> type_short n
#> <chr> <int>
#> 1 AMARI 1
#> 2 Copia 1
#> 3 DNA 94
#> 4 G3 1
#> 5 Gypsy 3
#> 6 hAT 5
#> 7 HELITRON7 1
#> 8 Jockey 10
#> 9 LINE 54
#> 10 LTR 7
#> 11 Mariner 35
#> 12 TE 1
#> 13 TTAAG 1
#> 14 Unknown 190
#> 15 Utopia 1
Theoretically, you could also try to use string similarity here. Yet your types do not have great similarity among themselves. A relative Levenshtein distance (distance / characters of the longer string) for example retrieves results like this:
strings <- c("Mariner-1_Amel", "Mariner-10")
adist(strings) / max(nchar(strings))
#> [,1] [,2]
#> [1,] 0.0000000 0.3571429
#> [2,] 0.3571429 0.0000000
This could be interpreted as the two types being 36% similar. Finding a good threshold might be hard in that case.
This solution uses package dplyr function case_when and base R grepl.
library(dplyr)
tibs %>%
mutate(category = case_when(
grepl("hAT|Jockey", type) ~ "Jokey",
grepl("Mariner", type) ~ "Mariner",
grepl("DNA", type) ~ "DNA",
grepl("LINE", type) ~"LINE",
TRUE ~ as.character(type)
),
category = factor(category)
)
If there is no commonality to define the groups you can define individual conditions using case_when.
library(dplyr)
library(stringr)
tibs %>%
mutate(category = case_when(str_detect(type, 'Mariner-\\d+') ~ 'Mariner',
str_detect(type, 'Jockey|hAT') ~ 'common',
#Add more conditions
))

subsetting data based with the condition of the current and previous entity in r

I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)

Resources