Find the non zero values and frequency of those values in R - r

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.

The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...

I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Related

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Creating data continuously using rnorm until an outlier occurs in R

Sorry for the confusing title, but i wasn't sure how to title what i am trying to do. My objective is to create a dataset of 1000 obs each would be the length of the run. I have created a phase1 dataset, from which a set of control limits are produced. What i am trying to do now is create a phase2 dataset most likely using rnorm. what im trying to do is create a repeat loop that will continuously create values in the phase2 dataset until one of those values is outside of the control limits produced from the phase1 dataset. for example if i had 3.0 and -3.0 as control limits the phase2 dataset would create a bunch of observations until obs 398 when the value here happens to be 3.45, thus stopping the creation of data. my objective is then to record the number 398. Furthermore, I am then trying to loop the code back to the phase1 dataset/ control limits portion and create a new set of control limits and then run another phase2, until i have 1000 run lengths recorded. the code i have for the phase1/ control limits works fine and looks like this:
nphase1=50
nphase2=1000
varcount=1
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- apply(phase1, 2, mean)
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
I have previously created this code in SAS and it looks like this. might be a better reference for what i am trying to achieve then me trying to explain it.
%macro phase2_dataset (n=,varcount=, meanshift=, sigmashift=, nphase1=,simID=,);
%do z=1 %to &n;
%phase1_dataset (n=&nphase1, varcount=&varcount);
data phase2; set control_limits n=lastobs;
call streaminit(0);
do until (phase2_var1<Lower_SPC_limit_method1_var1 or
phase2_var1>Upper_SPC_limit_method1_var1);
phase2_var1 = rand("normal", &meanshift, &sigmashift);
output;
end;
run;
ods exclude all;
proc means data=phase2;
var phase2_var1;
ods output summary=x;
run;
ods select all;
data run_length; set x;
keep Phase2_var1_n;
run;
proc append base= QA.Phase2_dataset&simID data=Run_length force; run;
%end;
%mend;
Also been doing research about using a while loop in replace of the repeat loop.
Im new to R so Any ideas you are able to throw my way are greatly appreciated. Thanks!
Using a while loop indeed seems to be the way to go. Here's what I think you're looking for:
set.seed(10) #Making results reproducible
replicate(100, { #100 is easier to display here
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- colMeans(phase1) #Slightly better than apply
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
#Phase 2
x <- 0
count <- 0
while(x > Lower_SPC_Limit_Method1 && x < Upper_SPC_Limit_Method1) {
x <- rnorm(1)
count <- count + 1
}
count
})
The result is:
[1] 225 91 97 118 304 275 550 58 115 6 218 63 176 100 308 844 90 2758
[19] 161 311 1462 717 2446 74 175 91 331 210 118 1517 420 32 39 201 350 89
[37] 64 385 212 4 72 730 151 7 1159 65 36 333 97 306 531 1502 26 18
[55] 67 329 75 532 64 427 39 352 283 483 19 9 2 1018 137 160 223 98
[73] 15 182 98 41 25 1136 405 474 1025 1331 159 70 84 129 233 2 41 66
[91] 1 23 8 325 10 455 363 351 108 3
If performance becomes a problem, perhaps it would be interesting to explore some improvements, like creating more numbers with rnorm() at a time and then counting how many are necessary to exceed the limits and repeat if necessary.

finding largest consecutive region in table

I'm trying to find regions in a file that have consecutive lines based on two columns. I want to find the largest span of consecutive values. If column 4 (V3) comes immediately before the second line's value for column 3 (V2), then write the output for the longest span of consecutive values.
The input looks like this. input:
> x
grp V1 V2 V3 V4 V5 V6
1: 1 DOG.1 142 144 132 134 0
2: 2 DOG.1 313 315 303 305 0
3: 3 DOG.1 316 318 306 308 0
4: 4 DOG.1 319 321 309 311 0
5: 5 DOG.1 322 324 312 314 0
the output should look like this:
out.name in out
[1,] "DOG.1" "313" "324"
Notice how the x[1,] was removed and how the output is starting at x[2,3] and ending at x[5,4]. All of these values are consecutive.
One obvious way is to take tail(x$V2, -1L) - head(x$V3, -1L) and get the start and end indices corresponding to the maximum consecutive 1s. But I'll skip it here (and leave it to others) as I'd like to show how this can be done with the help of IRanges package:
require(data.table)
require(IRanges) ## Bioconductor package
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out.name = "DOG.1",
in = start(x.ir)[max.idx],
out = end(x.ir)[max.idx])
# out.name bla out
# 1: DOG.1 313 324

Resources