How to conditionally select top value per group without comparing each value? - r

The data I have looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
4 Gene1 0.7 0 1
4 Gene2 0.61 1 0
4 Gene3 0.62 0 1
I am grouping the genes by the Group column then selecting the best gene per group based on conditions:
Select the gene with the highest score if the score difference between the top scored gene and all others in the group is >0.05
If the score difference between the top gene and any other genes in a group is <0.05 then select the gene with a higher direct_count only selecting between those genes with a <0.05 distance to the top scored gene per group
If the direct_count is the same select the gene with the highest secondary_count
If all counts are the same select all genes <0.05 distance to each other.
Output from example looking like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5 #highest direct_count
2 CHST6 0.4295135 1 3 #highest secondary_count after matching direct_count
3 ACE 0.634 1 1 #ACE and NOS2 have matching counts
3 NOS2 0.6345 1 1
4 Gene1 0.7 0 1 #highest score by >0.05 difference
Currently I try to code this with:
df<- setDT(df)
new_df <- df[,
{d = dist(Score, method = 'manhattan')
if (any(d > 0.05))
ind = which.max(d)
else if (sum(max(direct_count) == direct_count) == 1L)
ind = which.max(direct_count)
else if (sum(max(secondary_count) == secondary_count) == 1L)
ind = which.max(secondary_count)
else
ind = which((outer(direct_count, direct_count, '==') & outer(secondary_count, secondary_count, '=='))[1, ])
.SD[ind]
}
, by = Group]
However, I am struggling to adjust my first else if statement to account for my 2nd condition with only selecting between genes with a <0.05 distance to the top scored gene - currently it's comparing with all genes per group so even if a gene in that group has a 0.1 score but largest count columns its getting selected over a top scored gene at 0.7 for example if other genes in the group are 0.68 filling that <0.05 distance requirement.
Essentially I want my conditions 2 to 4 to only be considering the genes that are <0.05 distance to the top scored gene per group.
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1","Gene2","Gene3"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.7, 0.62, 0.61), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L, 0L, 1L, 0L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L, 0L, 0L, 1L)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Edit:
The reason for my question is a problem with one specific group not doing as I expect:
Group Gene Score direct_count secondary_count
1 2 CFDP1 0.5517401 1 62
2 2 CHST6 0.5989186 1 6
3 2 RNU6-758P 0.5644914 0 1
4 2 Gene1 0.5672916 0 1
5 2 TMEM170A 0.6167083 0 2
CHST6 has the highest direct_count out of all genes <0.05 of the to the top scored gene in this group, yet Gene1 is being selected.
This second example input data:
structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1",
"CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502,
0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62,
6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table",
"data.frame"))

You final goal can be achieved with two different solution: with dplyr and with data.table.
You don't need any complicated ifelse condition.
Solution
INPUT
dt <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
DPLYR
library(dplyr)
dt %>%
group_by(Group) %>%
filter((max(Score) - Score)<0.05) %>%
slice_max(direct_count, n = 1) %>%
slice_max(secondary_count, n = 1) %>%
ungroup()
#> # A tibble: 4 x 5
#> Group Gene Score direct_count secondary_count
#> <int> <chr> <dbl> <int> <int>
#> 1 1 AQP11 0.557 4 5
#> 2 2 CHST6 0.430 1 3
#> 3 3 ACE 0.634 1 1
#> 4 3 NOS2 0.634 1 1
DATA.TABLE
library(data.table)
dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#> Group Gene Score direct_count secondary_count
#> 1: 1 AQP11 0.5566507 4 5
#> 2: 2 CHST6 0.4295135 1 3
#> 3: 3 ACE 0.6340000 1 1
#> 4: 3 NOS2 0.6345000 1 1
Your EDIT
Related to your specific issues at the end of your question: these two methods select CHST6, as you would expect according to the rules you wrote.
dt <- structure(list(Group = c(2L, 2L, 2L, 2L, 2L),
Gene = c("CFDP1", "CHST6", "RNU6-758P", "Gene1", "TMEM170A"),
Score = c(0.551740109920502, 0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006),
direct_count = c(1, 1, 0, 0, 0),
secondary_count = c(62, 6, 1, 1, 2)),
row.names = c(NA, -5L),
class = c("data.table",
"data.frame"))
########## DPLYR
library(dplyr)
dt %>%
group_by(Group) %>%
filter((max(Score) - Score)<0.05) %>%
slice_max(direct_count, n = 1) %>%
slice_max(secondary_count, n = 1) %>%
ungroup()
#> # A tibble: 1 x 5
#> Group Gene Score direct_count secondary_count
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 2 CHST6 0.599 1 6
########## DATATABLE
library(data.table)
dt <- dt[dt[, .I[(max(Score) - Score) < 0.05], by = Group]$V1]
dt <- dt[dt[, .I[direct_count == max(direct_count)], by = Group]$V1]
dt <- dt[dt[, .I[secondary_count == max(secondary_count)], by = Group]$V1]
dt
#> Group Gene Score direct_count secondary_count
#> 1: 2 CHST6 0.5989186 1 6

Related

How to show percentage up to the current observation?

I am working with the following data frame:
I am wondering how I can create a new column which shows the percentage of the indicator column for all previous observations within the group. So the above data frame would become:
Basically, the new column just indicates the percentage (in decimal form) of the indicator up to that point within the group. It just divides the sum of the indicator column up to that point by the row count of previous observations within the group.
My first thought was to use group_by along with row_number in order reference previous observations, but I couldn't figure out how to make it work.
Data:
structure(list(Group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), Indicator = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L), IndicatorPercent = c(NA,
1, 0.5, 0.67, 0.75, NA, 0, 0, 0, 0.25)), class = "data.frame", row.names = c(NA,
-10L))
We get the cummean of the 'Indicator' after grouping by 'Group' and then get the lag on it
library(dplyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = lag(cummean(Indicator))) %>%
ungroup
-output
# A tibble: 10 x 3
# Group Indicator IndicatorPercent
# <int> <int> <dbl>
# 1 1 1 NA
# 2 1 0 1
# 3 1 1 0.5
# 4 1 1 0.667
# 5 1 0 0.75
# 6 2 0 NA
# 7 2 0 0
# 8 2 0 0
# 9 2 1 0
#10 2 0 0.25
If we want to do this based on value of other column, use replace
library(tidyr)
df1 %>%
group_by(Group) %>%
mutate(IndicatorPercent = replace(rep(NA_real_, n()),
color == 'red', lag(cummean(Indicator[color == "red"])))) %>%
fill(IndicatorPercent) %>%
ungroup
Or with data.table
library(data.table)
setDT(df1)[color == 'red',
IndicatorPercent := shift(cummean(Indicator)), Group][,
IndicatorPercent := nafill(IndicatorPercent, type = 'locf'), Group][]

How to calculate average variation per groups in r?

I have a dataset of groups of genes with each gene having a different score. I am looking to calculate the average gene score and average variation/difference of scores between genes per group.
For example my data looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I am looking to add another column giving the average model score per group and a column for the average variation between scores per group.
So far for the average score per group, I am using
group_average_score <- aggregate( Score ~ Group, df, mean )
Although I am struggling to get this added as an additional column in the data.
Then for taking the average variation score per group I've been trying to go from a similar question (Calculate difference between values by group and matched for time) but I'm struggling to adjust this for my data. I've tried:
test <- df %>%
group_by(Group) %>%
mutate(Diff = c(NA, diff(Score)))
But I'm not sure this is calculating the average variation out of all gene's Score per group. The output using my real data gives a couple different variation average scores per group when there should be just one.
Expected output should look something like:
Group Gene Score direct_count secondary_count Average_Score Average_Score_Difference
1 AQP11 0.5566507 4 5 0.46160593 0.183650
1 CLNS1A 0.2811747 0 2 0.46160593 0.183650
1 RSF1 0.5469924 3 6 0.46160593 0.183650
2 CFDP1 0.4186066 1 2 ... ...
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I think the Average_Score_Difference is fine but just to note I've done it by hand for sake of example (differences each gene has with each other summed and divided by 3 for Group 1).
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
Try this solution with dplyr but more infor about how to compute last column should be provided:
library(dplyr)
#Code
newdf <- df %>% group_by(Group) %>% mutate(Avg=mean(Score,na.rm = T),
Diff=c(0,abs(diff(Score))),
AvgPerc=mean(Diff,na.rm=T))
Output:
# A tibble: 7 x 8
# Groups: Group [3]
Group Gene Score direct_count secondary_count Avg Diff AvgPerc
<int> <chr> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 1 AQP11 0.557 4 5 0.462 0 0.180
2 1 CLNS1A 0.281 0 2 0.462 0.275 0.180
3 1 RSF1 0.547 3 6 0.462 0.266 0.180
4 2 CFDP1 0.419 1 2 0.424 0 0.00545
5 2 CHST6 0.430 1 3 0.424 0.0109 0.00545
6 3 ACE 0.634 1 1 0.634 0 0.000250
7 3 NOS2 0.634 1 1 0.634 0.000500 0.000250
Some data used:
#Data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))
Using data.table
library(data.table)
setDT(df)[, c('Avg', 'Diff') := .(mean(Score, na.rm = TRUE),
c(0, abs(diff(Score)))), Group][, AvgPerc := mean(Diff), Group]
data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))

Matching the previous row in a specific column and performing a calculation in R

I currently have a data file that resembles this:
R ID A B
1 A1 0 0
2 A1 2 4
3 A1 4 8
4 A2 0 0
5 A2 3 3
6 A2 6 6
I would like to write a script that will only calculate "(8-4)/(4-2)" from the previous row only if the "ID" matches. For example, in the output for a column "C" in row 3, if A1 == A1 in the "ID" column, then (8-4)/(4-2) = 2. If A1 != A1, then output is 0.
I would like the output to be like this:
R ID A B C
1 A1 0 0 0
2 A1 2 4 2
3 A1 4 8 2
4 A2 0 0 0
5 A2 3 3 1
6 A2 6 6 1
Hopefully I explained this correctly in a non-confusing manner.
We could group_by ID, use diff to calculate difference between rows and divide.
library(dplyr)
df %>% group_by(ID) %>% mutate(C = c(0, diff(B)/diff(A)))
# R ID A B C
# <int> <fct> <int> <int> <dbl>
#1 1 A1 0 0 0
#2 2 A1 2 4 2
#3 3 A1 4 8 2
#4 4 A2 0 0 0
#5 5 A2 3 3 1
#6 6 A2 6 6 1
and similarly using data.table
library(data.table)
setDT(df)[, C := c(0, diff(B)/diff(A)), ID]
data
df <- structure(list(R = 1:6, ID = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A1", "A2"), class = "factor"), A = c(0L, 2L,
4L, 0L, 3L, 6L), B = c(0L, 4L, 8L, 0L, 3L, 6L)), class = "data.frame",
row.names = c(NA, -6L))
We can also use lag
library(dplyr)
df %>%
group_by(ID) %>%
mutate(C = (B - lag(B, default = first(B)))/(A - lag(A, default = first(A))))
data
df <- structure(list(R = 1:6, ID = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A1", "A2"), class = "factor"), A = c(0L, 2L,
4L, 0L, 3L, 6L), B = c(0L, 4L, 8L, 0L, 3L, 6L)), class = "data.frame",
row.names = c(NA, -6L))

Using R to manipulate dataframe: each row of a column to separate columns

I have a longer dataframe with student name, subjects, question names and their marks. A short version of this dataframe looks like the below:
c1 c2 c3 c4
A 1 a 1
A 1 b 0.5
A 1 c 1
A 2 a 2
A 2 b 1.5
A 2 c 3
A 2 d 3
B 1 a 0
B 1 b 1.5
B 1 c 2
B 2 a 2
B 2 b 2.5
B 2 c 4
B 2 d 5
Here A, B are students, 1,2 are subjects, a, b,c are question names and the last column is marks obtained by each student, in each subject and in each question.
I would like to consolidate part of dataframe with question name column becoming heading for several columns with corresponding marks like below:
Student Sub:1 Sub:2
Name a b c a b c d
A 1 0.5 1 2 1.5 3 3
B 0 1.5 2 2 2.5 4 5
I can use transpose function (t(dataframe)) to switch column to rows but I don't know how to do partly retaining details of students and marks.
Can some one guide how can I achieve this?
We can use pivot_wider, supplying column names from c2 and c3 columns and value from c4.
tidyr::pivot_wider(df,names_from = c(c2, c3),values_from = c4, names_prefix = "Sub")
# c1 Sub1_a Sub1_b Sub1_c Sub2_a Sub2_b Sub2_c Sub2_d
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 1 0.5 1 2 1.5 3 3
#2 B 0 1.5 2 2 2.5 4 5
Or with data.table, dcast
data.table::dcast(df, c1~paste0("Sub",c2)+c3, value.var = "c4")
data
df <- structure(list(c1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
c2 = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), c3 = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L,
3L, 1L, 2L, 3L, 4L), .Label = c("a", "b", "c", "d"), class = "factor"),
c4 = c(1, 0.5, 1, 2, 1.5, 3, 3, 0, 1.5, 2, 2, 2.5, 4, 5)), class = "data.frame",
row.names = c(NA, -14L))
Here is base R solution using reshape()
df$namecol <- with(df,paste0(c2,c3))
dfout <- reshape(df[-c(2:3)],direction = "wide",idvar = "c1",timevar = "namecol")
names(dfout) <- gsub(".*\\.(.*)","\\1",names(dfout))
such that
> dfout
c1 1a 1b 1c 2a 2b 2c 2d
1 A 1 0.5 1 2 1.5 3 3
8 B 0 1.5 2 2 2.5 4 5

Counting occurrence of a variable without taking account duplicates

I have a big data frame, called data with 1 004 490 obs, and I want to analyse the success of a treatment.
ID POSITIONS TREATMENT
1 0 A
1 1 A
1 2 B
2 0 C
2 1 D
3 0 B
3 1 B
3 2 C
3 3 A
3 4 A
3 5 B
So firstly, I want to count the number of time that one treatment is applicated to a patient (ID), but one treatment can be given several times to an iD. So, do I need to first delete all the duplicates and after count or there is a function that don't take into account all the duplicates.
What I want to have :
A : 2
B : 2
C : 2
D : 1
Then, I want to know how many time the treatment was given at the last position, but the last position is always different according to the ID.
What I want to have :
A : 0
B : 2 (for ID = 1 and 3)
C : 0
D : 1 (for ID = 1)
Thanks for your help, I am a new user of R !
Using base R, we can do,
merge(aggregate(ID ~ TREATMENT, df, FUN = function(i) length(unique(i))),
aggregate(ID ~ TREATMENT, df[!duplicated(df$ID, fromLast = TRUE),], toString),
by = 'TREATMENT', all = TRUE)
Which gives,
TREATMENT ID.x ID.y
1 A 2 <NA>
2 B 2 1, 3
3 C 2 <NA>
4 D 1 2
Here is a tidyverse approach, where we get the distinct rows based on 'ID', 'TREATMENT' and get the count of 'TREATMENT'
library(tidyverse)
df1 %>%
distinct(ID, TREATMENT) %>%
count(TREATMENT)
# A tibble: 4 x 2
# TREATMENT n
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
and for second output, after grouping by 'ID', slice the last row (n()), create a column 'ind' and fill that with 0 for all missing combinations of 'TREATMENT' with complete, then get the sum of 'ind' after grouping by 'TREATMENT'
df1 %>%
group_by(ID) %>%
slice(n()) %>%
mutate(ind = 1) %>%
complete(TREATMENT = unique(df1$TREATMENT), fill = list(ind=0)) %>%
group_by(TREATMENT) %>%
summarise(n = sum(ind))
# A tibble: 4 x 2
# TREATMENT n
# <chr> <dbl>
#1 A 0
#2 B 2
#3 C 0
#4 D 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), POSITIONS = c(0L, 1L, 2L, 0L, 1L, 0L, 1L, 2L, 3L, 4L, 5L
), TREATMENT = c("A", "A", "B", "C", "D", "B", "B", "C", "A",
"A", "B")), .Names = c("ID", "POSITIONS", "TREATMENT"),
class = "data.frame", row.names = c(NA, -11L))

Resources