Getting the first exceedance date over a threshold in a sequence - r

I have a csv file with three columns. The first column is pentad dates (73 pentads in a year) while the second and third columns are for precipitation values.
What I want to do:
[1]. Get the first pentad when the precipitation exceeds the "annual mean" in "at least three consecutive pentads".
I can subset the first column like this:
dat<-read.csv("test.csv",header=T,sep=",")
aa<-which(dat$RR>mean(dat$RR))
This gives me the following:
[1] 27 28 29 30 31 34 36 37 38 41 42 43 44 45 46 52 53 54 55 56 57
The correct output should be P27 in this case.
In the second column:
[1] 31 32 36 38 39 40 41 42 43 44 45 46 47 48 49 50 53 54 55 57 59 60 61
The correct output should be P38.
How can I add a conditional statement here taking into consideration the "three consecutive pentads"?
I don't know how I can implement this in R (in a code). I'll appreciate any suggestion.
I have the following data:
Pentad RR YY
1 0 0.5771428571
2 0.0142857143 0
3 0 1.2828571429
4 0.0885714286 1.4457142857
5 0.0714285714 0.1114285714
6 0 0.36
7 0.0657142857 0
8 0.0285714286 0
9 0.0942857143 0
10 0.0114285714 1
11 0 0.0114285714
12 0 0.0085714286
13 0 0.3057142857
14 0 0
15 0 0
16 0 0
17 0.04 0
18 0 0.8
19 0.8142857143 0.0628571429
20 0.2857142857 0
21 1.14 0
22 5.3342857143 0
23 2.3514285714 0
24 1.9857142857 0.0133333333
25 1.4942857143 0.0433333333
26 2.0057142857 1.4866666667
27 20.0485714286 0
28 25.0085714286 2.4866666667
29 16.32 1.9433333333
30 11.0685714286 0.7733333333
31 8.9657142857 8.1066666667
32 3.9857142857 7.7333333333
33 5.2028571429 0.5
34 7.8028571429 4.3566666667
35 4.4514285714 2.66
36 9.22 6.6266666667
37 32.0485714286 4.4042857143
38 19.5057142857 7.9771428571
39 3.1485714286 12.9428571429
40 2.4342857143 18.4942857143
41 9.0571428571 7.3571428571
42 28.7085714286 11.0828571429
43 34.1514285714 9.0342857143
44 33.0257142857 14.2914285714
45 46.5057142857 34.6142857143
46 70.6171428571 45.3028571429
47 3.1685714286 6.66
48 1.9285714286 6.7028571429
49 7.0314285714 5.9628571429
50 0.9028571429 14.8542857143
51 5.3771428571 2.1
52 11.3571428571 2.8371428571
53 15.0457142857 7.3914285714
54 11.6628571429 32.0371428571
55 21.24 9.0057142857
56 11.4371428571 3.5257142857
57 11.6942857143 12.32
58 2.9771428571 2.32
59 4.3371428571 7.9942857143
60 0.8714285714 6.5657142857
61 1.3914285714 4.7714285714
62 0.8714285714 2.3542857143
63 1.1457142857 0.0057142857
64 2.3171428571 2.5085714286
65 0.1828571429 0.8171428571
66 0.2828571429 2.8857142857
67 0.3485714286 0.8971428571
68 0 0
69 0.3457142857 0
70 0.1428571429 0
71 0.18 0
72 4.8942857143 0.1457142857
73 0.0371428571 0.4342857143

Something like this should do it:
first_exceed_seq <- function(x, thresh = mean(x), len = 3)
{
# Logical vector, does x exceed the threshold
exceed_thresh <- x > thresh
# Indices of transition points; where exceed_thresh[i - 1] != exceed_thresh[i]
transition <- which(diff(c(0, exceed_thresh)) != 0)
# Reference index, grouping observations after each transition
index <- vector("numeric", length(x))
index[transition] <- 1
index <- cumsum(index)
# Break x into groups following the transitions
exceed_list <- split(exceed_thresh, index)
# Get the number of values exceeded in each index period
num_exceed <- vapply(exceed_list, sum, numeric(1))
# Get the starting index of the first sequence where more then len exceed thresh
transition[as.numeric(names(which(num_exceed >= len))[1])]
}
first_exceed_seq(dat$RR)
first_exceed_seq(dat$YY)

Related

Script out of bounds in R

I am using a code based on Deseq2. One of my goals is to plot a heatmap of data.
heatmap.data <- counts(dds)[topGenes,]
The error I am getting is
Error in counts(dds)[topGenes, ]: subscript out of bounds
the first few line sof my counts(dds) function looks like this.
99h1 99h2 99h3 99h4 wth1 wth2
ENSDARG00000000002 243 196 187 117 91 96
ENSDARG00000000018 42 55 53 32 48 48
ENSDARG00000000019 91 91 108 64 95 94
ENSDARG00000000068 3 10 10 10 30 21
ENSDARG00000000069 55 47 43 53 51 30
ENSDARG00000000086 46 26 36 18 37 29
ENSDARG00000000103 301 289 289 199 347 386
ENSDARG00000000151 18 19 17 14 22 19
ENSDARG00000000161 16 17 9 19 10 20
ENSDARG00000000175 10 9 10 6 16 12
ENSDARG00000000183 12 8 15 11 8 9
ENSDARG00000000189 16 17 13 10 13 21
ENSDARG00000000212 227 208 259 234 78 69
ENSDARG00000000229 68 72 95 44 71 64
ENSDARG00000000241 71 92 67 76 88 74
ENSDARG00000000324 11 9 6 2 8 9
ENSDARG00000000370 12 5 7 8 0 5
ENSDARG00000000394 390 356 339 283 313 286
ENSDARG00000000423 0 0 2 2 7 1
ENSDARG00000000442 1 1 0 0 1 1
ENSDARG00000000472 16 8 3 5 7 8
ENSDARG00000000476 2 1 2 4 6 3
ENSDARG00000000489 221 203 169 144 84 114
ENSDARG00000000503 133 118 139 89 91 112
ENSDARG00000000529 31 25 17 26 15 24
ENSDARG00000000540 25 17 17 10 28 19
ENSDARG00000000542 15 9 9 6 15 12
How do I ensure all the elements of the top genes are present in it?
When I try to see 20 top genes in the dataset. it looks like a list of genes
6339" "12416" "1241" "3025" "12791" "846" "15090"
[8] "6529" "14564" "4863" "12777" "1122" "7454" "13716"
[15] "5790" "3328" "1231" "13734" "2797" "9072" with the column head V1.
I have used both
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = TRUE)
and
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = FALSE)
to see if the out of bounds error is removed. However it was of no use. I guess the V1 head is causing the issue.
The top genes function has been generated using the above code snippet.
resordered <- res[order(res$padj),]
#Reorder gene list by increasing pAdj
resordered <- as.data.frame(res[order(res$padj),])
#Filter for genes that are differentially expressed with an FDR < 0.01
ii <- which(res$padj < 0.01)
length(ii)
# Use the rownames() function to get the top 20 differentially expressed genes from our results table
topGenes <- rownames(resordered[1:20,])
topGenes
# Get the counts from the DESeqDataSet using the counts() function
heatmap.data <- counts(dds)[topGenes,]
Perhaps this will do what you want?
counts_dds <- counts(dds)
topgenes <- c("ENSDARG00000000002", "ENSDARG00000000489", "ENSDARG00000000503",
"ENSDARG00000000540", "ENSDARG00000000529", "ENSDARG00000000542")
heatmap.data <- counts_dds[rownames(counts_dds) %in% topgenes,]
If you provide more information it will be easier to advise you on how to fix your problem.

Nonlinear model in r

What is the problem with the following r code as I get error?
nonlinear <- function(G,Q,T) {
Y=G+Q*X^T
}
Model <- nls(nonlinear, start = list(G=0.4467, Q=-0.0020537, T=1), data=sample1)
Error: object of type 'closure' is not subsettable
Taking the data from your other question Nonlinear modelling starting values and the code from #Roland this works:
sample1 <- read.table(header=TRUE, text=
"X Y Z
135 -0.171292376 85
91 0.273954718 54
171 -0.288513438 107
88 -0.17363066 54
59 -1.770852012 50
1 0 37
1 0 32
1 0.301029996 36
2 -0.301029996 39
1 1.041392685 30
11 -0.087150176 42
9 0.577236408 20
34 -0.355387658 28
15 0.329058719 17
32 -0.182930683 24
21 0.196294645 21
33 0.114954516 91
43 -0.042403849 111
39 -0.290034611 88
20 -0.522878746 76
6 -0.301029995 108
3 0.477121254 78
9 0 63
9 0.492915522 51
28 -0.243038048 88
16 -0.028028724 17
15 -0.875061263 29
2 -0.301029996 44
1 0 52
1 1.531478917 65")
nonlinear<-function(X,G,Q,T) G+Q*X^T
nls(Y ~ nonlinear(X,G,Q,T), start=list(G=-0.4, Q=0.2, T=-1), data=sample1)
Depending from the data I had to change the starting values!

Calculate number of values in vector that exceed values in column of data.frame

I have a long list of numbers, e.g.
set.seed(123)
y<-round(runif(100, 0, 200))
And I would like to store in column y the number of values that exceed each value in column x of a data frame:
df <- data.frame(x=seq(0,200,20))
I can compute the numbers manually, like this:
length(which(y>=20)) #93 values exceed 20
length(which(y>=40)) #81 values exceed 40
etc. I know I can use a for-loop with all values of x, but is there a more elegant way?
I tried this:
df$y <- length(which(y>=df$x))
But this gives a warning and does not give me the desired output.
The data frame should look like this:
df
x y
1 0 100
2 20 93
3 40 81
4 60 70
5 80 61
6 100 47
7 120 40
8 140 29
9 160 19
10 180 8
11 200 0
You can compare each value of df$x against all value of y using sapply
sapply(df$x, function(a) sum(y>a))
#[1] 99 93 81 70 61 47 40 29 18 6 0
#Looking at your output, maybe you want
sapply(df$x, function(a) sum(y>=a))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Here's another approach using outer that allows for element wise comparison of two vectors
rowSums(outer(df$x,y, "<="))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Yet one more (from alexis_laz's comment)
length(y) - findInterval(df$x, sort(y), left.open = TRUE)
# [1] 100 93 81 70 61 47 40 29 19 8 0

Mean and SD in R

maybe it is a very easy question. This is my data.frame:
> read.table("text.txt")
V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860
It means that I have 22516 sequences with length 26, 17129 sequences with length 28, etc. I would like to know the sequence length mean and its standard deviation. I know how to do it, but I know to do it creating a list full of 26 repeated 22516 times and so on... and then compute the mean and SD. However, I thing there is a easier method. Any idea?
Thanks.
For mean: (V1 %*% V2)/sum(V2)
For SD: sqrt(((V1-(V1 %*% V2)/sum(V2))**2 %*% V2)/sum(V2))
I do not find mean(rep(V1,V2)) # 61.902 and sd(rep(V1,V2)) # 14.23891 that complex, but alternatively you might try:
weighted.mean(V1,V2) # 61.902
# recipe from http://www.ltcconline.net/greenl/courses/201/descstat/meansdgrouped.htm
sqrt((sum((V1^2)*V2)-(sum(V1*V2)^2)/sum(V2))/(sum(V2)-1)) # 14.23891
Step1: Set up data:
dat.df <- read.table(text="id V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860",header=T)
Step2: Convert to data.table (only for simplicity and laziness in typing)
library(data.table)
dat <- data.table(dat.df)
Step3: Set up new columns with products, and use them to find mean
dat[,pr:=V1*V2]
dat[,v1sq:=as.numeric(V1*V1*V2)]
dat.Mean <- sum(dat$pr)/sum(dat$V2)
dat.SD <- sqrt( (sum(dat$v1sq)/sum(dat$V2)) - dat.Mean^2)
Hope this helps!!
MEAN = (V1*V2)/sum(V2)
SD = sqrt((V1*V1*V2)/sum(V2) - MEAN^2)

In R: Indexing vectors by boolean comparison of a value in range: index==c(min : max)

In R, let's say we have a vector
area = c(rep(c(26:30), 5), rep(c(500:504), 5), rep(c(550:554), 5), rep(c(76:80), 5)) and another vector yield = c(1:100).
Now, say I want to index like so:
> yield[area==27]
[1] 2 7 12 17 22
> yield[area==501]
[1] 27 32 37 42 47
No problem, right? But weird things start happening when I try to index it by using c(A, B). (and even weirder when I try c(min:max) ...)
> yield[area==c(27,501)]
[1] 7 17 32 42
What I'm expecting is of course the instances that are present in both of the other examples, not just some weird combination of them. This works when I can use the pipe OR operator:
> yield[area==27 | area==501]
[1] 2 7 12 17 22 27 32 37 42 47
But what if I'm working with a range? Say I want index it by the range c(27:503)? In my real example there are a lot more data points and ranges, so it makes more sense, please don't suggest I do it by hand, which would essentially mean:
yield[area==27 | area==28 | area==29 | ... | area==303 | ... | area==500 | area==501]
There must be a better way...
You want to use %in%. Also notice that c(27:503) and 27:503 yield the same object.
> yield[area %in% 27:503]
[1] 2 3 4 5 7 8 9 10 12 13 14 15 17
[14] 18 19 20 22 23 24 25 26 27 28 29 31 32
[27] 33 34 36 37 38 39 41 42 43 44 46 47 48
[40] 49 76 77 78 79 80 81 82 83 84 85 86 87
[53] 88 89 90 91 92 93 94 95 96 97 98 99 100
Why not use subset?
subset(yield, area > 26 & area < 504) ## for indexes
subset(area, area > 26 & area < 504) ## for values

Resources