How count matching groups by a threshold in r [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I have a dataset of with a list of genes that I have used 2 machine learning models on, and so have 2 sets of predicted scores. I am looking to identify how many genes are in a similar score range between the 2 groups.
For example my data looks like this:
Gene1 Score1 Gene2 Score2
PPL 0.77 COL8A1 0.78
NPHS2 0.77 ARHGEF25 0.77
EHD4 0.75 C1GALT1 0.77
THBS1 0.74 CEP164 0.76
PRKAA1 0.74 MLLT3 0.76
WNT7A 0.73 PPL 0.76
DVL1 0.72 MRVI1 0.75
TUBGCP4 0.71 BMPR1B 0.75
SARM1 0.71 RAB1A 0.75
VPS4A 0.70 CLTC 0.75
In this, the only matching gene in the 2 lists is PPL - I'm trying to write code to pull this out so e.g. the code gives all matching genes between the 2 lists with a score >0.75. I'm trying to do this to check genes at multiple score thresholds.
I've looked at using code from similarly worded questions, but none have a similar data structure that works with mine. I've tried using filter() and match() but haven't got it working, any help would be appreciated.
Input data:
dput(df)
structure(list(Gene1 = c("PPL", "NPHS2", "EHD4", "THBS1", "PRKAA1",
"WNT7A", "DVL1", "TUBGCP4", "SARM1", "VPS4A"), `Score1` = c(0.78,
0.77, 0.75, 0.74, 0.74, 0.73,
0.72, 0.71, 0.71, 0.70), Gene2 = c("COL8A1",
"ARHGEF25", "C1GALT1", "CEP164", "MLLT3", "PPL", "MRVI1", "BMPR1B",
"RAB1A", "CLTC"), `Score2` = c(0.78, 0.77,
0.77, 0.76, 0.76, 0.76, 0.75,
0.75, 0.75, 0.75)), row.names = c(NA, -10L
), class = c("data.table", "data.frame"))

You can self join the data frame with itself to get all the common genes in the data.
library(dplyr)
inner_join(df, df, by = c('Gene1' = 'Gene2')) %>%
select(Gene1, Score1 = Score1.x, Score2 = Score2.y)
# Gene1 Score1 Score2
#1: PPL 0.78 0.76
You can then filter Score1 and Score2 based on some threshold.

Staying in data.table:
library(data.table)
df1 <- df[,.(Gene1,Score1)]
df2 <- df[,.(Gene2,Score2)]
threshold <- 0.75
df1[df2, on = .(Gene1 = Gene2)][Score1 > threshold & Score2 > threshold]
Gene1 Score1 Score2
1: PPL 0.78 0.76

Related

Remove leading zeros in numbers *within a data frame*

Edit: For anyone coming later: THIS IS NOT A DUPLICATE, since it explicitely concerns work on data frames, not single variables/vectors.
I have found several sites describing how to drop leading zeros in numbers or strings, including vectors. But none of the descriptions I found seem applicable to data frames.
Or the f_num function in the numform package. It treats "[a] vector of numbers (or string equivalents)", but does not seem to solve unwanted leading zeros in a data frame.
I am relatively new to R but understand that I could develop some (in my mind) complex code to drop leading zeros by subsetting vectors from a data frame and then combining those vectors into a full data frame. I would like to avoid that.
Here is a simple data frame:
df <- structure(list(est = c(0.05, -0.16, -0.02, 0, -0.11, 0.15, -0.26,
-0.23), low2.5 = c(0.01, -0.2, -0.05, -0.03, -0.2, 0.1, -0.3,
-0.28), up2.5 = c(0.09, -0.12, 0, 0.04, -0.01, 0.2, -0.22, -0.17
)), row.names = c(NA, 8L), class = "data.frame")
Which gives
df
est low2.5 up2.5
1 0.05 0.01 0.09
2 -0.16 -0.20 -0.12
3 -0.02 -0.05 0.00
4 0.00 -0.03 0.04
5 -0.11 -0.20 -0.01
6 0.15 0.10 0.20
7 -0.26 -0.30 -0.22
8 -0.23 -0.28 -0.17
I would want
est low2.5 up2.5
1 .05 .01 .09
2 -.16 -.20 -.12
3 -.02 -.05 .00
4 .00 -.03 .04
5 -.11 -.20 -.01
6 .15 .10 .20
7 -.26 -.30 -.22
8 -.23 -.28 -.17
Is that possible with relatively simple code for a whole data frame?
Edit: An incorrect link has been removed.
I am interpreting the intention of your question is to convert each numeric cell in the data.frame into a "pretty-printed" string which is possible using string substitution and a simple regular expression (a good question BTW since I do not know any method to configure the output of numeric data to suppress leading zeros without converting the numeric data into a string!):
df2 <- data.frame(lapply(df,
function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", as.character(x)))),
stringsAsFactors = FALSE)
df2
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.2 -.12
# 3 -.02 -.05 0
# 4 0 -.03 .04
# 5 -.11 -.2 -.01
# 6 .15 .1 .2
# 7 -.26 -.3 -.22
# 8 -.23 -.28 -.17
str(df2)
# 'data.frame': 8 obs. of 3 variables:
# $ est : chr ".05" "-.16" "-.02" "0" ...
# $ low2.5: chr ".01" "-.2" "-.05" "-.03" ...
# $ up2.5 : chr ".09" "-.12" "0" ".04" ...
If you want to get a fixed number of digits after the decimal point (as shown in the expected output but not asked for explicitly) you could use sprintf or format:
df3 <- data.frame(lapply(df, function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", sprintf("%.2f", x)))), stringsAsFactors = FALSE)
df3
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.20 -.12
# 3 -.02 -.05 .00
# 4 .00 -.03 .04
# 5 -.11 -.20 -.01
# 6 .15 .10 .20
# 7 -.26 -.30 -.22
# 8 -.23 -.28 -.17
Note: This solution is not robust against different decimal point character (different locales) - it always expects a decimal point...

Return value in column 1 when value in column 2 exceeds 2 for 1st time

I have a dataframe called "new_dat" containing the time (days) in column t, and temperature data (and occaisionally NA) in columns A - C (please see the example in the code below):
> new_dat
t A B C
1 0.00 0.82 0.88 0.46
2 0.01 0.87 0.94 0.52
3 0.02 NA NA NA
4 0.03 0.95 1.03 0.62
5 0.04 0.98 1.06 0.67
6 0.05 1.01 1.09 0.71
7 0.06 2.00 1.13 2.00
8 0.07 1.06 1.16 0.78
9 0.08 1.07 1.18 0.81
10 0.09 1.09 1.20 0.84
11 0.10 1.10 1.21 0.86
12 0.11 2.00 1.22 0.87
Here is a dput() of the dataframe:
structure(list(t = c(0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07,
0.08, 0.09, 0.1, 0.11), A = c(0.82, 0.870000000000001, NA,
0.949999999999999,
0.979999999999997, 1.01, 2, 1.06, 1.07, 1.09, 1.1, 2), B =
c(0.879999999999999,
0.940000000000001, NA, 1.03, 1.06, 1.09, 1.13, 1.16, 1.18, 1.2,
1.21, 1.22), C = c(0.460000000000001, 0.520000000000003, NA,
0.619999999999997, 0.669999999999998, 0.709999999999997, 2,
0.780000000000001,
0.809999999999999, 0.84, 0.859999999999999, 0.87)), .Names = c("t",
"A", "B", "C"), row.names = c(NA, 12L), class = "data.frame")
As output, I want a vector (list?) of the values of column t where the temperature reading from columns A-C >= 2 for the first time (and only the first time), or - if the temperature is never >= 2 - return the last time reading in column t (0.11 in my example). So 'A' would return the value 0.06 (and not 0.11), 'B' would have the value 0.11 and 'C' 0.06. I intended to use the vector generated to create a new dataframe something like this:
A B C
0.06 0.11 0.06
I'm inexperienced with R (and code in general) so, despite reading that looping can be ineficient (but not really understanding how to accomplish what i want without it), I tried to solve this by looping first by column and then by row as follows:
#create blank vector to add my results to
aer <- c()
#loop by column, then by row, adding values according to the if statement
for (c in 2:ncol(new_dat)){
c <- c
for (r in 1:nrow(new_dat)){
r <- r
if ((!is.na(new_dat[r,c] )) & (new_dat[r,c] >= 2)){
aer <- c(aer, new_dat$t[r])
}
}
}
This returns my vector, aer, as:
> aer
[1] 0.06 0.11 0.06
So it's returning both instances where 'A' is 2, and the one from column 'C'.
I dont know how to instruct the loop to stop and move to the next column after finding one instance where my 'if' statement is true. I also tried adding an 'else' to cover the situation where temperature doesnt exceed 2:
else {
aer <- c(aer, new_dat$t[nrow(new_dat)])
But this did not work.
I would appreciate any help in completing the code, or suggestions for a better solution.
library(tidyverse)
new_dat %>%
gather(col, temp, -t) %>% # reshape data
na.omit() %>% # remove rows with NAs
group_by(col) %>% # for each column value
summarise(v = ifelse(is.na(first(t[temp >= 2])), last(t), first(t[temp >= 2]))) %>% # return the last t value if there are no temp >=2 otherwise return the first t with temp >= 2
spread(col, v) # reshape again
# # A tibble: 1 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 0.06 0.11 0.06
This solution will create the dataframe for you automatically, instead of returning a vector for you to create the dataframe yourself.
Here is a two steps solution.
First get an index vector of the values you want, then use that index vector to subset the dataframe.
inx <- sapply(new_dat[-1], function(x) {
w <- which(x >= 2)
if(length(w)) min(w) else NROW(x)
})
new_dat[inx, 1]
#[1] 0.06 0.11 0.06

R - using dplyr to aggregate on a continuous variable

So I have a data frame of participant data, where I have participant IDs, for each of those a bunch of target values (continuous) and predicted values.
The target value is a continuous variable, but there is a finite number of possible values, and each participant will have made a prediction for a subset of these target values.
For example, take this data frame:
data.frame(
subjectID = c(rep("p001",4),rep("p002",4),rep("p003",4)),
target = c(0.1,0.2,0.3,0.4,0.2,0.3,0.4,0.5,0.1,0.3,0.4,0.5),
pred = c(0.12, 0.23, 0.31, 0.42, 0.18, 0.32, 0.44, 0.51, 0.09, 0.33, 0.41, 0.55)
)
There're 5 possible target values: 0.1, 0.2, 0.3, 0.4 and 0.5, but each participant only predicted 4 of these values each. I want to get the average prediction pred for each target value target. It's further complicated by each participant having a group, and I only want to average within each group.
I tried using summarise_at but it wasn't liking the continuous data, and whilst I'm pretty experienced in coding in R, it's been a long while since I've done data summary manipulations etc.
I could do this easily in a for loop, but I want to learn to do this properly and I wasn't able to find a solution after googling for a long time.
Thanks very much
H
Just add the second grouping variable in group_by as well:
df <- data.frame(
subjectID = c(rep("p001",4),rep("p002",4),rep("p003",4)),
group = c(rep("A", 8), rep("B", 4)),
target = c(0.1,0.2,0.3,0.4,0.2,0.3,0.4,0.5,0.1,0.3,0.4,0.5),
pred = c(0.12, 0.23, 0.31, 0.42, 0.18, 0.32, 0.44, 0.51, 0.09, 0.33, 0.41, 0.55)
)
df %>%
group_by(target, group) %>%
summarise(mean(pred))
Output:
# A tibble: 9 x 3
# Groups: target [?]
target group `mean(pred)`
<dbl> <chr> <dbl>
1 0.100 A 0.120
2 0.100 B 0.0900
3 0.200 A 0.205
4 0.300 A 0.315
5 0.300 B 0.330
6 0.400 A 0.430
7 0.400 B 0.410
8 0.500 A 0.510
9 0.500 B 0.550

Assign groups based on the trend

I have searched a lot for this simple question, but have not found a solution. It looks really simple. I have a dataframe with a column like this:
Value
0.13
0.35
0.62
0.97
0.24
0.59
0.92
0.16
0.29
0.62
0.98
All values have a range between 0 and 1. What I want is that when the value starts to drop, I assign a new group to it. Within each group, the value is increasing. So the ideal outcome will look like this:
Value Group
0.13 1
0.35 1
0.62 1
0.97 1
0.24 2
0.59 2
0.92 2
0.16 3
0.29 3
0.62 3
0.98 3
Does anyone have a suggestion for how to address this?
This should do the trick, and uses only vectorised base functions. You may want to exchange the < for <=, if thats the behaviour you wanted.
vec <- c(0.13, 0.35, 0.62, 0.97, 0.24, 0.59, 0.92, 0.16, 0.29, 0.62, 0.98)
cumsum(c(1, diff(vec) < 0))
This isn't the most elegant solution, but it works:
value <- c(0.13, 0.35, 0.62, 0.97, 0.24, 0.59, 0.92, 0.16, 0.29, 0.62, 0.98)
foo <- data.frame(value, group = 1)
current_group <- 1
for(i in 2:nrow(foo)){
if(foo$value[i] >= foo$value[i-1]){
foo$group[i] <- current_group
}else{
current_group <- current_group + 1
foo$group[i] <- current_group
}
}
df <- data.frame( x = c(0.13, 0.35, 0.62, 0.97, 0.24, 0.59, 0.92, 0.16, 0.29, 0.62, 0.98))
df$y <- c(df$x[-1], NA) # lag column
df$chgdir <- as.numeric(df$y - df$x < 0) # test for change in direction
df$chgdir[is.na(df$chgdir)] <- 0 # deal with NA
df$group <- cumsum(df$chgdir) + 1 # determine group number
df[,c("x", "group")]
#> x group
#> 1 0.13 1
#> 2 0.35 1
#> 3 0.62 1
#> 4 0.97 2
#> 5 0.24 2
#> 6 0.59 2
#> 7 0.92 3
#> 8 0.16 3
#> 9 0.29 3
#> 10 0.62 3
#> 11 0.98 3

How to include calculations in apply or rowsum?

I need to include some operations before summing the rows in my data frame. Here is an example:
df1 <- data.frame(
AC1Q = c(0.53, 0.57, 0.60, 0.51),
AC4Q = c(0.15, 0.12, 0.09,0.19),
AC2Q = c(0.09, 0.05, 0.07, 0.05),
AC3Q = c(0.23, 0.26, 0.23, 0.26)
)
df1
# AC1Q AC4Q AC2Q AC3Q
# 1 0.53 0.15 0.09 0.23
# 2 0.57 0.12 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
I want to get the row sums based on (sin(2*pi*(AC1Q-0.25)) + sin(2*pi*(-AC4Q+0.25)) - sin(2*pi*(AC2Q+0.25)) - sin(2*pi*(AC3Q-0.25)))/4) The result should be:
# 1 0.20
# 2 0.15
# 3 0.21
# 4 0.10
I am learning apply and tried apply(df1, 1, function(x) (sin(2*pi*(df1$AC1Q-0.25)) + sin(2*pi*(-df1$AC4Q+0.25)) - sin(2*pi*(-df1$AC2Q+0.25)) - sin(2*pi*(df1$AC3Q-0.25)))/4)but the result is wrong. I am not sure what I did wrong. I know I can always do the calculation for each column first, combine them into a data frame, and use rowsum But is there a more efficient way to do it?
apply(df1, 1, function(x) (sin(2*pi)*(x["AC1Q"]-0.25) +
sin(2*pi)*(-x["AC4Q"]+0.25) -
sin(2*pi)*(-x["AC2Q"]+0.25) -
sin(2*pi)*(x["AC3Q"]-0.25))/4)

Resources