Count values in a data set that exceed a threshold in R - r

I have 2 data sets. The first data set has a vector of p-values from 0.5 - 0.001, and the corresponding threshold that meets that p-vale. For example, for 0.05, the value is 13. Any value greater than 13 has a p-value of <0.05. This data set contains all my thresholds that I'm interested in. Like so:
V1 V2
1 0.500 10
2 0.200 11
3 0.100 12
4 0.050 13
5 0.010 14
6 0.001 15
The 2nd data set is just one long list of values. I need to write an R script that counts the number of values in this set that exceed each threshold. For example, count how many values in the 2nd data set that exceed 13, and therefore have a p-value of <0.05, and do this fore each threshold value.
Here are the first 15 values of the 2nd data set (1000 total):
1 11.100816
2 8.779858
3 10.510090
4 9.503772
5 9.392222
6 10.285920
7 8.317523
8 10.007738
9 11.021283
10 9.964725
11 9.081947
12 11.253643
13 10.896120
14 10.272814
15 10.282408

Function which will help you:
length( which( data$V1 > 3 & data$V2 <0.05 ) )

Assuming dat1 and dat2 both have a V2 column, something like this:
colSums(outer(dat2$V2, setNames(dat1$V2, dat1$V2), ">"))
# 10 11 12 13 14 15
# 9 3 0 0 0 0
(reads as follows: 9 items have a value greater than 10, 3 items have a value greater than 11, etc.)

Related

R: check that a sequence of numbers meets a certain tolerance level

values <- c(5, 3, 2, 2.9999, 2.9998, 2.9997, 2.99996, 2.9995, 2.9994, 2.9993,
9, 2, 1.9999, 2.9999, 2.9998, 2.9997, 2.99996, 2.9995, 2.9994, 2.9993)
I have a string of values, and I want to obtain the indices in which the difference between any two consecutive numbers is below some tolerance level.
tol = 0.001
> which(abs(diff(values)) < tol)
[1] 4 5 6 7 8 9 12 14 15 16 17 18 19
I want to make sure that the difference between any two numbers meets the tolerance level for at least 5 consecutive values, so the output should look something like this (no index 12 anymore because even though the difference between 2 and 1.9999 is below tol, the difference between 1.9999 and 2.9999 is not below tol, so the 5 consecutive number rule is not met)
4 5 6 7 8 9 14 15 16 17 18 19
How can I check the difference between any two numbers is less than the tolerance level for at least 5 consecutive values?
You could use rle to check for 5 consecutive values.
which(with(rle(abs(diff(values)) < tol), rep(values & lengths >= 5, lengths)))
#[1] 4 5 6 7 8 9 14 15 16 17 18 19
You could use stats::filter to check for 5 consecutive values that meet some condition.
which(filter(abs(diff(values)) < tol, filter=rep(1, 5), sides=1)==5) - 4
[1] 4 5 14 15
Which give the starting positions of the indices that have 5 consecutive values whose differences are within tol.

Get column mean every block on n rows based on condition

I have this dataframe
r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816
I need to calculate the mean of column r2 for every block of rows where the distance between two values in the column distance is equal or less than 100000.
For this example the desired output would be:
mean_r2 diff_of_distance
1 17.86 75527 ## mean of rows 1 to 5; distance 5 - distance 1
2 13.91 66164 ## mean of rows 2 to 5; distance 5 - distance 2
3 13.84 68551 ## mean of rows 3 to 6; distance 6 - distance 3
4 13.14 74161 ## mean of rows 4 to 7; distance 7 - distance 4
5 9.40 73047 ## mean of rows 5 to 7; distance 7 - distance 5
6 11.04 76081 ## mean of rows 6 to 8; distance 8 - distance 6
and so on.
Edit 1: I have more than 100,000 rows.
Thanks.
Loop through each value of distance, minus this from the values in the distance vector and test if the result is less than 100000. This creates a boolean vector which you sum to identify the index at which the distance is greater than 100000 (i.e. bool becomes FALSE). Use this index to identify your block then take the mean of r2 in each block.
To speed up the code define your vector type and length (to avoid "growing vectors" on each iteration.
means <- vector("numeric", length = nrow(df))
rows <- vector("numeric", length = nrow(df))
distance_diff <- vector("numeric", length = nrow(df))
for (i in seq_along(df$distance)) {
dis_val <- df$distance[i] # the ith distance value
bools <- (df$distance - dis_val) < 100000 # bool indicating if difference between i and every value in vector is less than 100000
block_range <- sum(bools)# taking sum of bools identifies the value at which the distance becomes > 100000
rows[i] <- paste(as.character(i), "-", as.character(block_range))
means[i] <- mean(df$r2[i:block_range]) # take the mean of r2 in the range i to all rows where distance is < 100000
distance_diff[i] <- df$distance[block_range] - dis_val # minus the distance from the value before distance is > 100000 from i
}
data.frame(mean_r2 = means, rows= rows, diff_of_distance=distance_diff)
mean_r2 rows diff_of_distance
1 17.860000 1 - 5 75527
2 13.915000 2 - 5 66164
3 13.842500 3 - 6 68551
4 13.140000 4 - 7 74161
5 9.403333 5 - 7 73047
6 11.043333 6 - 8 76081
7 17.076667 7 - 9 66116
8 26.016667 8 - 10 67065
9 32.695000 9 - 10 38260
10 39.080000 10 - 10 0
You can try:
# your data
d <- read.table(text="r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816", header=T)
library(tidyverse) #dplyr_0.7.2
d %>%
mutate(index=1:n()) %>% add row index
group_by(index) %>% # group by this index
# calculate difference and find max row where diff < 100000
mutate(max_row=max(which(.$distance - distance < 100000, arr.ind=T))) %>%
# calculate mean
mutate(mean_r2=mean(.$r2[index:max_row])) %>%
# calculate the difference
mutate(diff_of_distance=.$distance[max_row] - .$distance[index]) %>%
# unite the columns
unite(rows, index, max_row, sep = "-")
# A tibble: 10 x 5
r2 distance rows mean_r2 diff_of_distance
* <dbl> <int> <chr> <dbl> <int>
1 33.64 67866 1-5 17.860000 75527
2 8.50 77229 2-5 13.915000 66164
3 15.07 109119 3-6 13.842500 68551
4 24.35 142279 4-7 13.140000 74161
5 7.74 143393 5-7 9.403333 73047
6 8.21 177670 6-8 11.043333 76081
7 12.26 216440 7-9 17.076667 66116
8 12.66 253751 8-10 26.016667 67065
9 26.31 282556 9-10 32.695000 38260
10 39.08 320816 10-10 39.080000 0
This works because group_by subsets the dataframe, thus you can access within mutate the respective distance value per group and calculate the difference with the complete vector using .$distance as this access the complete column regardless the group_by() function.

R is not ordering data correctly - skips E values

I am trying to order data by the column weightFisher. However, it is almost as if R does not process e values as low, because all the e values are skipped when I try to order from smallest to greatest.
Code:
resultTable_bon <- GenTable(GOdata_bon,
weightFisher = resultFisher_bon,
weightKS = resultKS_bon,
topNodes = 15136,
ranksOf = 'weightFisher'
)
head(resultTable_bon)
#create Fisher ordered df
indF <- order(resultTable_bon$weightFisher)
resultTable_bonF <- resultTable_bon[indF, ]
what resultTable_bon looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
1 GO:0019373 epoxygenase P450 pathway 19 13 1.12 1
2 GO:0097267 omega-hydroxylase P450 pathway 9 7 0.53 2
3 GO:0042738 exogenous drug catabolic process 10 7 0.59 3
weightFisher weightKS
1 1.9e-12 0.79744
2 7.9e-08 0.96752
3 2.5e-07 0.96336
what "ordered" resultTable_bonF looks like:
GO.ID Term Annotated Significant Expected Rank in weightFisher
17 GO:0014075 response to amine 33 7 1.95 17
18 GO:0034372 very-low-density lipoprotein particle re... 11 5 0.65 18
19 GO:0060710 chorio-allantoic fusion 6 4 0.35 19
weightFisher weightKS
17 0.00014 0.96387
18 0.00016 0.83624
19 0.00016 0.92286
As #bhas says, it appears to be working precisely as you want it to. Maybe it's the use of head() that's confusing you?
To put your mind at ease, try it with something simpler
dtf <- data.frame(a=c(1, 8, 6, 2)^-10, b=c(7, 2, 1, 6))
dtf
# a b
# 1 1.000000e+00 7
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
dtf[order(dtf$a), ]
# a b
# 2 9.313226e-10 2
# 3 1.653817e-08 1
# 4 9.765625e-04 6
# 1 1.000000e+00 7
Try the following :
resultTable_bon$weightFisher <- as.numeric (resultTable_bon$weightFisher)
Then :
resultTable_bonF <- resultTable_bon[order(resultTable_bonF$weightFisher),]

approx() without duplicates?

I am using approx() to interpolate values.
x <- 1:20
y <- c(3,8,2,6,8,2,4,7,9,9,1,3,1,9,6,2,8,7,6,2)
df <- cbind.data.frame(x,y)
> df
x y
1 1 3
2 2 8
3 3 2
4 4 6
5 5 8
6 6 2
7 7 4
8 8 7
9 9 9
10 10 9
11 11 1
12 12 3
13 13 1
14 14 9
15 15 6
16 16 2
17 17 8
18 18 7
19 19 6
20 20 2
interpolated <- approx(x=df$x, y=df$y, method="linear", n=5)
gets me this:
interpolated
$x
[1] 1.00 5.75 10.50 15.25 20.00
$y
[1] 3.0 3.5 5.0 5.0 2.0
Now, the first and last value are duplicates of my real data, is there any way to prevent this or is it something I don't understand properly about approx()?
You may want to specify xout to avoid this. For instance, if you want to always exclude the first and the last points, here's how you can do that:
specify_xout <- function(x, n) {
seq(from=min(x), to=max(x), length.out=n+2)[-c(1, n+2)]
}
plot(df$x, df$y)
points(approx(df$x, df$y, xout=specify_xout(df$x, 5)), pch = "*", col = "red")
It does not prevent from interpolating the existing point somewhere in the middle (exactly what happens on the picture below).
approx will fit through all your original datapoints if you give it a chance (change n=5 to xout=df$x to see this). Interpolation is the process of generating values for y given unobserved values of x, but should agree if the values of x have been previously observed.
The method="linear" setup is going to 'draw' linear segments joining up your original coordinates exactly (and so will give the y values you input to it for integer x). You only observe 'new' y values because your n=5 means that for points other than the beginning and end the x is not an integer (and therefore not one of your input values), and so gets interpolated.
If you want observed values not to be exactly reproduced, then maybe add some noise via rnorm ?

Maximum Intermediate Volatility

I have two vectors, a and b. See attached.
a is the signal and is a probability.
b is the absolute percentage change the next period.
Signalt <- seq(0, 1, 0.05)
I would like to find the maximum absolute return occuring within each intermediate 5%-tile (Signalt) of the a vector. So if it is
0.01, 0.02, 0.03, 0.06 0.07
then it should calculate the maximum return between
0.01 and 0.02,
0.01 and 0.03,
0.02 and 0.03.
Then move on to
0.06 and 0.07 do it over etc.
Output would then be combined in a matrix or table when the entire sequence has run.
It should follow the index from vector a and b.
i is an index that is updated by one every time that a crosses into a new percentile. t(i) is the bucket associated with the ith cross.
a is the probability vector which has length tao. This vector should be analyzed in its 5% tiles, with the maximum intermediate absolute return being the output. The price change of next period is the vector b. This would be represented by P in the equation below.
l and m are indexes.
Every time Signal moves from one 5% tile to another, we compute the
largest absolute return that occurs between any two intermediate
buckets, until Signal moves to another 5% tile. For example, suppose
that Signal moves into the 85th percentile and 4 volume buckets later
moves into the 90th percentile. We would then calculate absolute
returns between buckets 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, 3
and 4. We are interested in the maximum absolute return. We would then
calculate the max return in the following percentile bucket, move on
to the next, which could be an 85th percentile and so on. So we let i
be an index that is updated by 1 every time that Signal moves from one
percentile into another, and τ(i) the bucket associated with the ith
cross.
This is the equation I am using. The notation might vary slightly.
Now my question is how to go about this. Perhaps someone has an intuitive solution to this.
I hope my question is clear.
"a","b"
0,0.013013698630137
0,0.0013522650439487
0,0.00135409614082593
0,0.00203389830508471
0.27804813511593,0.00135317997293627
0.300237801284318,0
0.495965075167796,0.00405405405405412
0.523741892051237,0.000672947510094168
0.558753750296458,0.00202020202020203
0.665762829019002,0.000672043010752743
0.493106479913899,0.000671591672263272
0.344592579573497,0.000672043010752854
0.336263897823707,0.00201748486886366
0.35884763774257,0.00536912751677865
0.23662807979007,0.00133511348464632
0.212636893966841,0.00267379679144386
0.362212830513403,0.000666666666666593
0.319216408413927,0.00333555703802535
0.277670854167344,0
0.310143323100971,0
0.374104373036218,0.00267737617135211
0.190943075221511,0.00268456375838921
0.165770070508112,0.00200803212851386
0.240310208616952,0.00133600534402145
0.212418038918236,0.00200133422281523
0.204282022136019,0.00200534759358306
0.363725074298064,0.000667111407605114
0.451807761954326,0.000666666666666593
0.369296011692801,0.000666222518321047
0.37503495989363,0.0026666666666666
0.323386355686901,0.00132978723404265
0.189216171830472,0.00266311584553924
0.185252052821193,0.00199203187250996
0.174882909380997,0.000662690523525522
0.149291525540782,0.00132625994694946
0.196824215268048,0.00264900662251666
0.164611993131396,0.000660501981505912
0.125470998266484,0.00132187706543285
0.179999532586703,0.00264026402640272
0.368749638521621,0.000658327847267826
0.427799340926225,0
My interpretation of the question
I hope I understand your question correctly. Here is what I understood:
For each row you compute which 5% percentile it belongs to
Whenever that percentile changes, you start a new bucket
All rows from the same bucket result in a single resulting value
If there is only a single row in a bucket, the b value from that row is the resulting value
Otherwise, you compute all abs(b[l]/b[m]-1) where m<l and both belong to the same bucket
Basic answer
Code
This code here does what I describe above:
# read the data (shortened, full data in OP)
d <- read.table(textConnection("a,b
0,0.013013698630137
[…]
0.427799340926225,0
"), sep=",", header=TRUE)
# compute percentile number for each line
d$percentile <- floor(d$a/0.05)*5 + 5
# start a new bucket whenever the percentile changes
d$bucket <- cumsum(c(1, diff(d$percentile) != 0))
# compute a single number for all rows of the same bucket
aggregate(b ~ percentile + bucket, d, function(b) {
if(length(b) == 1) return(b); # special case of only a single row
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1)) # compare all pairs
return(max(m[upper.tri(m)])) # only return pairs with m < l
})
Output
The result will look like this:
percentile bucket b
1 5 1 0.8960891071
2 30 2 0.0013531800
3 35 3 0.0000000000
4 50 4 0.0040540541
5 55 5 0.0006729475
6 60 6 0.0020202020
7 70 7 0.0006720430
8 50 8 0.0006715917
9 35 9 2.0020174849
10 40 10 0.0053691275
11 25 11 1.0026737968
12 40 12 0.0006666667
13 35 13 0.0033355570
14 30 14 0.0000000000
15 35 15 0.0000000000
16 40 16 0.0026773762
17 20 17 0.2520080321
18 25 18 0.5010026738
19 40 19 0.0006671114
20 50 20 0.0006666667
21 40 21 3.0026666667
22 35 22 0.0013297872
23 20 23 0.7511597084
24 15 24 0.0013262599
25 20 25 0.7506605020
26 15 26 0.0013218771
27 20 27 0.0026402640
28 40 28 0.0006583278
29 45 29 0.0000000000
Additional columns
Code
If you also want to know the number of items in each group, then I suggest you use the plyr library:
library(plyr)
aggB <- function(b) {
if(length(b) == 1) return(b)
m <- outer(b, b, function(pm, pl) abs(pl/pm - 1))
return(max(m[upper.tri(m)]))
}
ddply(d, .(bucket), summarise,
percentile = percentile[1], n = length(b), maxr = aggB(b))
Output
This will give you the following result:
bucket percentile n maxr
1 1 5 4 0.8960891071
2 2 30 1 0.0013531800
3 3 35 1 0.0000000000
4 4 50 1 0.0040540541
5 5 55 1 0.0006729475
6 6 60 1 0.0020202020
7 7 70 1 0.0006720430
8 8 50 1 0.0006715917
9 9 35 2 2.0020174849
10 10 40 1 0.0053691275
11 11 25 2 1.0026737968
12 12 40 1 0.0006666667
13 13 35 1 0.0033355570
14 14 30 1 0.0000000000
15 15 35 1 0.0000000000
16 16 40 1 0.0026773762
17 17 20 2 0.2520080321
18 18 25 3 0.5010026738
19 19 40 1 0.0006671114
20 20 50 1 0.0006666667
21 21 40 2 3.0026666667
22 22 35 1 0.0013297872
23 23 20 3 0.7511597084
24 24 15 1 0.0013262599
25 25 20 2 0.7506605020
26 26 15 1 0.0013218771
27 27 20 1 0.0026402640
28 28 40 1 0.0006583278
29 29 45 1 0.0000000000
I am not sure to understand but here an attempt. My idea is to group data by centiles than do calculation on each group using by
To group data I create a new variable split
##dat$split <- cut(dat$a,seq(0, 1, 0.05),include.lowest=T)
dat$split <- c(0,cumsum(diff(dat$a) > 0.05))
Using by, I can performs my function en each group. I remove the singular cases of NULL prob values or one values.
by(dat,dat$split,FUN =function(x){
P <- x$b
if( is.null(P)||length(P) ==1) return(0)
nn <- length(P)
ind <- expand.grid(1:nn,1:nn) ## I generate indexes here
ret <- abs(P[ind[,1]]/P[ind[,2]]-1) ## perfom P_l/P_m-1 (vectorized)
list(P=P,
ret.max = max(ret),
ret.ind = ind[which.max(ret),])
})
Here the result list. For each interval I show ,
P ( Prob values),
The maximum return
The indexes from which this maximum is computed.
For example:
dat$split: 0
$P
[1] 0.0130 0.0014 0.0014 0.0020
$ret.max
[1] 8.6236
$ret.ind
Var1 Var2
5 1 2
---------------------------------------------------------------------------------------------------------------
dat$split: 1
$P
[1] 0.0014 0.0000
$ret.max
[1] 1
$ret.ind
Var1 Var2
2 2 1

Resources