finding largest consecutive region in table - r

I'm trying to find regions in a file that have consecutive lines based on two columns. I want to find the largest span of consecutive values. If column 4 (V3) comes immediately before the second line's value for column 3 (V2), then write the output for the longest span of consecutive values.
The input looks like this. input:
> x
grp V1 V2 V3 V4 V5 V6
1: 1 DOG.1 142 144 132 134 0
2: 2 DOG.1 313 315 303 305 0
3: 3 DOG.1 316 318 306 308 0
4: 4 DOG.1 319 321 309 311 0
5: 5 DOG.1 322 324 312 314 0
the output should look like this:
out.name in out
[1,] "DOG.1" "313" "324"
Notice how the x[1,] was removed and how the output is starting at x[2,3] and ending at x[5,4]. All of these values are consecutive.

One obvious way is to take tail(x$V2, -1L) - head(x$V3, -1L) and get the start and end indices corresponding to the maximum consecutive 1s. But I'll skip it here (and leave it to others) as I'd like to show how this can be done with the help of IRanges package:
require(data.table)
require(IRanges) ## Bioconductor package
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out.name = "DOG.1",
in = start(x.ir)[max.idx],
out = end(x.ir)[max.idx])
# out.name bla out
# 1: DOG.1 313 324

Related

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Loop for subsetting data.frame

I work with neuralnet package to predict values of stocks (diploma thesis). The example data are below
predict<-runif(23,min=0,max=1)
day<-c(369:391)
ChoosedN<-c(2,5,5,5,5,5,4,3,5,5,5,2,1,1,5,5,4,3,2,3,4,3,2)
Profit<-runif(23,min=-2,max=5)
df<-data.frame(predict,day,ChoosedN,Profit)
colnames(df)<-c('predict','day','ChoosedN','Profit')
But I haven't always same period for investments (ChoodedN). For backtest the neural site I have to skip the days when I am still in position even if the neural site says 'buy it' (i.e.predict > 0.5). The frame looks like this
predict day ChoosedN Profit
1 0.6762981061 369 2 -1.6288823350
2 0.0195611224 370 5 1.5682195597
3 0.2442795106 371 5 0.6195915225
4 0.9587601107 372 5 -1.9701975542
5 0.7415729680 373 5 3.7826137026
6 0.4814927997 374 5 4.1228808255
7 0.1340754859 375 4 3.7818792837
8 0.6316874851 376 3 0.7670884461
9 0.1107241728 377 5 -1.3367400097
10 0.5850426450 378 5 2.2848396166
11 0.2809308425 379 5 2.5234691438
12 0.2835292015 380 2 -0.3291319925
13 0.3328713216 381 1 4.7425349397
14 0.4766904986 382 1 -0.4062103292
15 0.5005860797 383 5 4.8612083721
16 0.2734292494 384 5 -0.2320077328
17 0.1488479455 385 4 2.6195679584
18 0.9446908936 386 3 0.4889716264
19 0.8222738281 387 2 0.7362413658
20 0.7570014759 388 3 4.6661250258
21 0.9988698252 389 4 2.6340743946
22 0.8384663551 390 3 1.0428046484
23 0.1938821415 391 2 0.8855748393
And I need to create new data.frame this way.For example:If predict (in first row) > 0.5,delete second and third row (because ChoosedN in first row is 2 so next two after first row has to be delete, because there we were still in position). And continue on fourth the same way (if predict (fourth row) > 0.5, delete next five rows and so. And of course, if predict <=0.5 delete this row too.
Any straightforward way how to do it with some loop?
Thanks
I would create a new dataframe, then bind the rows you want using rbind inside of a for loop
newDF <- data.frame() # New, Empty Dataframe
i = 1 # Loop index Variable
while (i < nrow(df)) {
if (df$predict[i] > 0.5) { # If predict > 0.5,
newDF <- rbind(newDF, df[i,]) # Bind the row
i = i + df$ChoosedN[i] # Adjust for ChoosedN rows
}
i = i + 1 # Move to the next row
}

printing a list based on range met

I would like to generate an string output into a list if some values are met. I have a table that looks like this:
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 1 go.1 142 144 132 134 0 31 11 F D T hy al qe 34 6 3
2: 2 go.1 313 315 303 305 0 31 11 q z t hr ye er 29 20 41
3: 3 go.1 316 318 306 308 0 31 11 f w y hu er es 64 43 19
4: 4 go.1 319 321 309 311 0 31 11 r a y ie uu qr 26 22 20
5: 5 go.1 322 324 312 314 0 31 11 g w y hp yu re 44 7 0
I'm using this function to generate a desired output:
library(IRanges); library(data.table)
rangeFinder = function(x){
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out = x[1,1],
start = start(x.ir)[max.idx],
end = end(x.ir)[max.idx])
return(ans)}
rangeFinder(x.out)
out start end
1: 1 313 324
I would also like to generate a list with the letters (from column V9-V11) in the between the start and end output from rangeFinder.
For example, the output should look like this.
out
[[go.1]]
[1] "qztfwyraygwy"
rangeFinder is looking at values in column V2 and V3 and printing the longest match of numbers. Notice how "FDT" is not included in the list output even though rangeFinder produced an output from 313-324 (and not from 142-324). How can I get the desired output?
reduce has an argument with.revmap to add a "metadata" column (accessible with mcols()) to the object. This associates with each reduced range the indexes of the original range that map to the reduced range, as an IntegerList class, basically a list where all elements are guaranteed to be integer vectors. So these are the rows you're interested in
ir <- with(x, IRanges(V2, V3))
r <- reduce(ir, with.revmap=TRUE)
i <- unlist(mcols(r)[which.max(width(r)), "revmap"])
and the data character string can be munged with something like
j <- paste0("V", 9:11)
paste0(as.matrix(x[i, j, drop=FALSE]), collapse="")
It's better to ask your questions about IRanges on the Bioconductor mailing list; no subscription required.
with.revmap is a convenience argument added relatively recently; I think
h = findOverlaps(ir, r)
i = queryHits(h)[subjectHits(h) == which.max(width(r))]
is a replacement.

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Clustering Large Data Matrix using R

I have a large data matrix (33183x1681), each row corresponding to one observation and each column corresponding to the variables.
I applied K-medoids clustering using PAM function in R, and I tried to visualize the clustering results using the built-in plots available with the PAM function. I got this error:
Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
cannot use cor=TRUE with a constant variable
I think this problem is because of the high dimensionality of the data matrix I'm trying to cluster.
Any thoughts/ideas how to tackle this issue?
Check out the clara() function in package cluster which is shipped with all versions of R.
library("cluster")
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax
> clarax
Call: clara(x = x, k = 2, samples = 50)
Medoids:
[,1] [,2]
[1,] -1.15913 0.5760027
[2,] 50.11584 50.3360426
Objective function: 10.23341
Clustering vector: int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Cluster sizes: 200 300
Best sample:
[1] 10 17 45 46 68 90 99 150 151 160 184 192 232 238 243 250 266 275 277
[20] 298 303 304 313 316 327 333 339 353 358 398 405 410 411 421 426 429 444 447
[39] 456 477 481 494 499 500
Available components:
[1] "sample" "medoids" "i.med" "clustering" "objective"
[6] "clusinfo" "diss" "call" "silinfo" "data"
Note that you should study the help for clara() (?clara) in some detail as well as the references cited in order to make the clustering performed by clara() as close to or identical to pam().

Resources