I have a dataset of groups of genes with each gene having a different score. I am looking to calculate the average gene score and average variation/difference of scores between genes per group.
For example my data looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I am looking to add another column giving the average model score per group and a column for the average variation between scores per group.
So far for the average score per group, I am using
group_average_score <- aggregate( Score ~ Group, df, mean )
Although I am struggling to get this added as an additional column in the data.
Then for taking the average variation score per group I've been trying to go from a similar question (Calculate difference between values by group and matched for time) but I'm struggling to adjust this for my data. I've tried:
test <- df %>%
group_by(Group) %>%
mutate(Diff = c(NA, diff(Score)))
But I'm not sure this is calculating the average variation out of all gene's Score per group. The output using my real data gives a couple different variation average scores per group when there should be just one.
Expected output should look something like:
Group Gene Score direct_count secondary_count Average_Score Average_Score_Difference
1 AQP11 0.5566507 4 5 0.46160593 0.183650
1 CLNS1A 0.2811747 0 2 0.46160593 0.183650
1 RSF1 0.5469924 3 6 0.46160593 0.183650
2 CFDP1 0.4186066 1 2 ... ...
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I think the Average_Score_Difference is fine but just to note I've done it by hand for sake of example (differences each gene has with each other summed and divided by 3 for Group 1).
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
Try this solution with dplyr but more infor about how to compute last column should be provided:
library(dplyr)
#Code
newdf <- df %>% group_by(Group) %>% mutate(Avg=mean(Score,na.rm = T),
Diff=c(0,abs(diff(Score))),
AvgPerc=mean(Diff,na.rm=T))
Output:
# A tibble: 7 x 8
# Groups: Group [3]
Group Gene Score direct_count secondary_count Avg Diff AvgPerc
<int> <chr> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 1 AQP11 0.557 4 5 0.462 0 0.180
2 1 CLNS1A 0.281 0 2 0.462 0.275 0.180
3 1 RSF1 0.547 3 6 0.462 0.266 0.180
4 2 CFDP1 0.419 1 2 0.424 0 0.00545
5 2 CHST6 0.430 1 3 0.424 0.0109 0.00545
6 3 ACE 0.634 1 1 0.634 0 0.000250
7 3 NOS2 0.634 1 1 0.634 0.000500 0.000250
Some data used:
#Data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))
Using data.table
library(data.table)
setDT(df)[, c('Avg', 'Diff') := .(mean(Score, na.rm = TRUE),
c(0, abs(diff(Score)))), Group][, AvgPerc := mean(Diff), Group]
data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))
Related
The data I have looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
4 Gene1 0.7 0 1
4 Gene2 0.61 1 0
4 Gene3 0.62 0 1
I am grouping the genes by the Group column then selecting the best gene per group based on conditions:
Select the gene with the highest score if the score difference between the top scored gene and all others in the group is >0.05
If the score difference between the top gene and any other genes in a group is <0.05 then select the gene with a higher direct_count only selecting between those genes with a <0.05 distance to the top scored gene per group
If the direct_count is the same select the gene with the highest secondary_count
If all counts are the same select all genes <0.05 distance to each other.
Output from example looking like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5 #highest direct_count
2 CHST6 0.4295135 1 3 #highest secondary_count after matching direct_count
3 ACE 0.634 1 1 #ACE and NOS2 have matching counts
3 NOS2 0.6345 1 1
4 Gene1 0.7 0 1 #highest score by >0.05 difference
Currently I try to code this with:
df<- setDT(df)
new_df <- df[,
{d = dist(Score, method = 'manhattan')
if (any(d > 0.05))
ind = which.max(d)
else if (sum(max(direct_count) == direct_count) == 1L)
ind = which.max(direct_count)
else if (sum(max(secondary_count) == secondary_count) == 1L)
ind = which.max(secondary_count)
else
ind = which((outer(direct_count, direct_count, '==') & outer(secondary_count, secondary_count, '=='))[1, ])
.SD[ind]
}
, by = Group]
However, I am struggling to adjust my first else if statement to account for my 2nd condition with only selecting between genes with a <0.05 distance to the top scored gene - currently it's comparing with all genes per group so even if a gene in that group has a 0.1 score but largest count columns its getting selected over a top scored gene at 0.7 for example if other genes in the group are 0.68 filling that <0.05 distance requirement.
Essentially I want my conditions 2 to 4 to only be considering the genes that are <0.05 distance to the top scored gene per group.
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1","Gene2","Gene3"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.7, 0.62, 0.61), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L, 0L, 0L, 1L)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Edit:
The reason for my question is a problem with one specific group not doing as I expect:
Group Gene Score direct_count secondary_count
1 2 CFDP1 0.5517401 1 62
2 2 CHST6 0.5989186 1 6
3 2 RNU6-758P 0.5644914 0 1
4 2 Gene1 0.5672916 0 1
5 2 TMEM170A 0.6167083 0 2
CHST6 has the highest direct_count out of all genes <0.05 of the to the top scored gene in this group, yet Gene1 is being selected.
This second example input data:
structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1",
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502,
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62,
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))
You final goal can be achieved with two different solution: with dplyr and with data.table.
You don't need any complicated ifelse condition.
Solution
INPUT
dt <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
DPLYR
library(dplyr)
dt %>%
group_by(Group) %>%
filter((max(Score) - Score)<0.05) %>%
slice_max(direct_count, n = 1) %>%
slice_max(secondary_count, n = 1) %>%
ungroup()
#> # A tibble: 4 x 5
#> Group Gene Score direct_count secondary_count
#> <int> <chr> <dbl> <int> <int>
#> 1 1 AQP11 0.557 4 5
#> 2 2 CHST6 0.430 1 3
#> 3 3 ACE 0.634 1 1
#> 4 3 NOS2 0.634 1 1
DATA.TABLE
library(data.table)
dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#> Group Gene Score direct_count secondary_count
#> 1: 1 AQP11 0.5566507 4 5
#> 2: 2 CHST6 0.4295135 1 3
#> 3: 3 ACE 0.6340000 1 1
#> 4: 3 NOS2 0.6345000 1 1
Your EDIT
Related to your specific issues at the end of your question: these two methods select CHST6, as you would expect according to the rules you wrote.
dt <- structure(list(Group = c(2L, 2L, 2L, 2L, 2L),
Gene = c("CFDP1", "CHST6", "RNU6-758P", "Gene1", "TMEM170A"),
Score = c(0.551740109920502, 0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006),
direct_count = c(1, 1, 0, 0, 0),
secondary_count = c(62, 6, 1, 1, 2)),
row.names = c(NA, -5L),
class = c("data.table",
"data.frame"))
########## DPLYR
library(dplyr)
dt %>%
group_by(Group) %>%
filter((max(Score) - Score)<0.05) %>%
slice_max(direct_count, n = 1) %>%
slice_max(secondary_count, n = 1) %>%
ungroup()
#> # A tibble: 1 x 5
#> Group Gene Score direct_count secondary_count
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 2 CHST6 0.599 1 6
########## DATATABLE
library(data.table)
dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#> Group Gene Score direct_count secondary_count
#> 1: 2 CHST6 0.5989186 1 6
I am a novice trying to analyze trap catch data in R and am looking for an efficient way to loop through by trap line. The first column is trap ID. The second column is the trap line that each trap is associated with. The remaining columns are values related to target catch and bycatch for each visit to the traps. I want to write code that will evaluate the data during each visit for each trap line. Here is an example of data I am working with:
Sample Data:
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
The number of traps per trapline varies. I have a code that I wrote out for each Trapline (there are 14 different traplines), but I was hoping there would be a way to consolidate it into one line of code that would calculate values while the trapline was constant, and then when it changed to the next trapline it would start a new calculation. Here is an example of how I was finding the sum of bycatch found at the Cemetery Trapline for visit 1.
CemetaryBycatch1 <- Data %>% select(Bycatch Visit 1 %>% filter(Data$Trapline == "Cemetery")
sum(CemetaryBycatch1)
As of right now I have code like this written out for each trapline for each visit, but with 14 traplines and 8 total visits, I would like to avoid having to write out so many lines of code and was hoping there was a way to loop through it with one block of code that would calculate value (sum, mean, etc.) for each trap line.
Thanks
Does something like this help you?
You can add a filter for Trapline in between group_by and summarise_all.
Code:
library(dplyr)
Data <- structure(list(Trap_ID = c(1L, 2L, 1L, 1L, 2L, 3L), Trapline = c("Cemetery",
"Cemetery", "Golf", "Church", "Church", "Church"), Target_Visit_1 = c(0L,
1L, 5L, 0L, 1L, 1L), Bycatch_Visit_1 = c(3L, 2L, 0L, 2L, 1L,
4L), Target_Visit_2 = c(1L, 1L, 2L, 0L, 1L, 0L), Bycatch_Visit_2 = c(4L,
2L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-6L))
df
Data %>%
group_by(Trap_ID, Trapline) %>%
summarise_all(list(sum))
Output:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 0 3 1 4
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)
Adding another row to Data:
Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
1 Cemetery 100 200 1 4
Will give you:
#> # A tibble: 6 x 6
#> # Groups: Trap_ID [3]
#> Trap_ID Trapline Target_Visit_1 Bycatch_Visit_1 Target_Visit_2 Bycatch_Visit_2
#> <int> <chr> <int> <int> <int> <int>
#> 1 1 Cemetery 100 203 2 8
#> 2 1 Church 0 2 0 0
#> 3 1 Golf 5 0 2 1
#> 4 2 Cemetery 1 2 1 2
#> 5 2 Church 1 1 1 1
#> 6 3 Church 1 4 0 0
Created on 2020-10-16 by the reprex package (v0.3.0)
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I have a data frame sex(male & female), age(child & adult), survive(yes & no) and frequency. How can I create a cross tab of sex and age?
sex age survive freq
male child yes 4
male adult yes 0
female child yes 6
female adult yes 3
male child no 1
male adult no 0
female child no 2
female adult no 1
I think you are looking for reshaping your data using pivot_wider from tidyr:
library(tidyr)
df %>% pivot_wider(., names_from = age, values_from = freq)
# A tibble: 4 x 4
sex survive child adult
<fct> <fct> <int> <int>
1 male yes 4 0
2 female yes 6 3
3 male no 1 0
4 female no 2 1
or
library(tidyr)
df %>% pivot_wider(., names_from = c(age, survive), values_from = freq)
# A tibble: 2 x 5
sex child_yes adult_yes child_no adult_no
<fct> <int> <int> <int> <int>
1 male 4 0 1 0
2 female 6 3 2 1
Is it what you are looking for ? If not, can you provide the expected outcome ?
Data
df = structure(list(sex = structure(c(2L, 2L, 1L, 1L, 2L, 2L, 1L,
1L), .Label = c("female", "male"), class = "factor"), age = structure(c(2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("adult", "child"), class = "factor"),
survive = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("no",
"yes"), class = "factor"), freq = c(4L, 0L, 6L, 3L, 1L, 0L,
2L, 1L)), class = "data.frame", row.names = c(NA, -8L))
I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))
I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!
I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790