R: check that a sequence of numbers meets a certain tolerance level - r

values <- c(5, 3, 2, 2.9999, 2.9998, 2.9997, 2.99996, 2.9995, 2.9994, 2.9993,
9, 2, 1.9999, 2.9999, 2.9998, 2.9997, 2.99996, 2.9995, 2.9994, 2.9993)
I have a string of values, and I want to obtain the indices in which the difference between any two consecutive numbers is below some tolerance level.
tol = 0.001
> which(abs(diff(values)) < tol)
[1] 4 5 6 7 8 9 12 14 15 16 17 18 19
I want to make sure that the difference between any two numbers meets the tolerance level for at least 5 consecutive values, so the output should look something like this (no index 12 anymore because even though the difference between 2 and 1.9999 is below tol, the difference between 1.9999 and 2.9999 is not below tol, so the 5 consecutive number rule is not met)
4 5 6 7 8 9 14 15 16 17 18 19
How can I check the difference between any two numbers is less than the tolerance level for at least 5 consecutive values?

You could use rle to check for 5 consecutive values.
which(with(rle(abs(diff(values)) < tol), rep(values & lengths >= 5, lengths)))
#[1] 4 5 6 7 8 9 14 15 16 17 18 19

You could use stats::filter to check for 5 consecutive values that meet some condition.
which(filter(abs(diff(values)) < tol, filter=rep(1, 5), sides=1)==5) - 4
[1] 4 5 14 15
Which give the starting positions of the indices that have 5 consecutive values whose differences are within tol.

Related

WGCNA : Choosing a soft-threshold power

powers = c(c(1:10), seq(from = 12, to=20, by=2));
While going through WGCNA i came across this code which i am not able to understand, can anybody explain me the meaning of that piece of code
The code will create a vector of numbers stored in powers.
Specifically: 1:10 creates the numbers 1 2 3 4 5 6 7 8 9 10 (can read as 1 through 10) and seq(from = 12, to = 20, by = 2) creates a sequence of every other number from 12 to 20, i.e. 12 14 16 18 20.
Powers will contain the following 15 numbers: 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
I am not familiar with the WGCNApackage or if powers is an argument to a function, but this is what powers contains.

How do I create a column using values of a second column that meet the conditions of a third in R?

I have a dataset Comorbidity in RStudio, where I have added columns such as MDDOnset, and if the age at onset of MDD < the onset of OUD, it equals 1, and if the opposite is true, then it equals 2. I also have another column PhysDis that has values 0-100 (numeric in nature).
What I want to do is make a new column that includes the values of PhysDis, but only if MDDOnset == 1, and another if MDDOnset==2. I want to make these columns so that I can run a t-test on them and compare the two groups (those with MDD prior OUD, and those who had MDD after OUD with regards to which group has a greater physical disability score). I want any case where MDDOnset is not 1 to be NA.
ttest1 <-t.test(Comorbidity$MDDOnset==1, Comorbidity$PhysDis)
ttest2 <-t.test(Comorbidity$MDDOnset==2, Comorbidity$PhysDis)
When I did the t test twice, once where MDDOnset = 1 and another when it equaled 2, the mean for y (Comorbidity$PhysDis) was the same, and when I looked into the original csv file, it turned out that this mean was the mean of the entire column, and not just cases where MDDOnset had a value of one or two. If there is a different way to run the t-tests that would have the mean of PhysDis only when MDDOnset = 1, and another with the mean of PhysDis only when MDDOnset == 2 that does not require making new columns, then please tell me.. Sorry if there are any similar questions or if my approach is way off, I'm new to R and programming in general, and thanks in advance.
Here's a smaller data frame where I tried to replicate the error where the new columns have switched lengths. The issue would be that the length of C would be 4, and the length of D would be 6 if I could replicate the error.
> A <- sample(1:10)
> B <-c(25,34,14,76,56,34,23,12,89,56)
> alphabet <-data.frame(A,B)
> alphabet$C <-ifelse(alphabet$A<7, alphabet$B, NA)
> alphabet$D <-ifelse(alphabet$A>6, alphabet$B, NA)
> print(alphabet)
A B C D
1 7 25 NA 25
2 9 34 NA 34
3 4 14 14 NA
4 2 76 76 NA
5 5 56 56 NA
6 10 34 NA 34
7 8 23 NA 23
8 6 12 12 NA
9 1 89 89 NA
10 3 56 56 NA
> length(which(alphabet$C>0))
[1] 6
> length(which(alphabet$D>0))
[1] 4
I would use the mutate command from the dplyr package.
Comorbidity <- mutate(Comorbidity, newColumn = (ifelse(MDDOnset == 1, PhysDis, "")), newColumn2 = (ifelse(MDDOnset == 2, PhysDis, "")))

Searching the closest value in other column

Suppose we have a data frame of two columns
X Y
10 14
12 16
14 17
15 19
21 19
The first element of Y that is 14, the nearest value (or same) to it is 14 (which is 3rd element of X). Similarly, next element of Y is closest to 15 that is 4th element of X
So, the output I would like should be
3
4
4
5
5
As my data is large, Can you give me some advice on the systemic/proper code for doing it?
You can try this piece of code:
apply(abs(outer(d$X,d$Y,FUN = '-')),2,which.min)
# [1] 3 4 4 5 5
Here, abs(outer(d$X,d$Y,FUN = '-')) returns a matrix of unsigned differences between d$X and d$Y, and apply(...,2,which.min) will return position of the minimum by row.

Filter between threshold

I am working with a large dataset and I am trying to first identify clusters of values that meet specific threshold values. My aim then is to only keep clusters of a minimum length. Below is some example data and my progress thus far:
Test = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
Sequence = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
Value = c(3,2,3,4,3,4,4,5,5,2,2,4,5,6,4,4,6,2,3,2)
Data <- data.frame(Test, Sequence, Value)
Using package evd, I have identified clusters of values >3
C1 <- clusters(Data$Value, u = 3, r = 1, cmax = F, plot = T)
Which produces
C1
$cluster1
4
4
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6
My problem is twofold:
1) I don't know how to relate this back to the original dataframe (for example to Test A & B)
2) How can I only keep clusters with a minimum size of 3 (thus excluding Cluster 1)
I have looked into various filtering options etc. however they do not cluster data according to a desired threshold, with no options for the minimum size of the cluster either.
Any help is much appreciated.
Q1: relate back to original dataframe: Have a look at Carl Witthoft's answer. He wrote a variant of rle() (seqle() because it allows one to look for integer sequences rather than repetitions): detect intervals of the consequent integer sequences
Q2: only keep clusters of certain length:
C1[sapply(C1, length) > 3]
yields the 2 clusters that are long enough:
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6

Count values in a data set that exceed a threshold in R

I have 2 data sets. The first data set has a vector of p-values from 0.5 - 0.001, and the corresponding threshold that meets that p-vale. For example, for 0.05, the value is 13. Any value greater than 13 has a p-value of <0.05. This data set contains all my thresholds that I'm interested in. Like so:
V1 V2
1 0.500 10
2 0.200 11
3 0.100 12
4 0.050 13
5 0.010 14
6 0.001 15
The 2nd data set is just one long list of values. I need to write an R script that counts the number of values in this set that exceed each threshold. For example, count how many values in the 2nd data set that exceed 13, and therefore have a p-value of <0.05, and do this fore each threshold value.
Here are the first 15 values of the 2nd data set (1000 total):
1 11.100816
2 8.779858
3 10.510090
4 9.503772
5 9.392222
6 10.285920
7 8.317523
8 10.007738
9 11.021283
10 9.964725
11 9.081947
12 11.253643
13 10.896120
14 10.272814
15 10.282408
Function which will help you:
length( which( data$V1 > 3 & data$V2 <0.05 ) )
Assuming dat1 and dat2 both have a V2 column, something like this:
colSums(outer(dat2$V2, setNames(dat1$V2, dat1$V2), ">"))
# 10 11 12 13 14 15
# 9 3 0 0 0 0
(reads as follows: 9 items have a value greater than 10, 3 items have a value greater than 11, etc.)

Resources