here i have the following dataframe df in R.
kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685
i want to replace in industry which incorrectly written Apparel, or APPEARLS to Apparels .
i tried using creating a list and run it through a loop.
l<-c('Apparel ','APPEARELS','apparels')
for(i in range(1:3)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
it is not working.only one element changes.
But, when i take the statement individually it is not creating an error and its working.
df$industry<-gsub(pattern=","Apparels",df$industry)
but this is a large dataset so i nned this to work in R please help.
sub without loop using | :
l <- c("Apparel" , "APPEARELS", "apparels")
# Using OPs data
sub(paste(l, collapse = "|"), "Apparels", df$industry)
# [1] "Apparels" "Apparels" "Apparels" "Airlines" "IT" "IT"
I'm using sub instead of gsub as there's only one occurrence of pattern in a string (at least in example).
While range returns a sequence in Python, it returns the minimum and maximum of a vector in R:
range(1:3)
# [1] 1 3
Instead, you could use 1:3 or seq(1,3) or seq_along(l), which all return
# [1] 1 2 3
Also note the difference between 'Apparel' and 'Apparel '.
So
df<-read.table(header=T, text="kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685")
l<-c('Apparel','APPEARELS','apparels')
for(i in seq_along(l)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
df
# kyid industry amount
# 1 112 Apparels 345436
# 2 234 Apparels 234567
# 3 213 Apparels 345678
# 4 345 Airlines 235678
# 5 123 IT 456789
# 6 124 IT 897685
Related
An VERY simplified example of my dataset:
HUC8 YEAR RO_MM
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
}
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – #r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2
has 2 columns, inconsistent with item 1 which has 3 columns. To fill
missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i],
stringsAsFactors = F) : Detected 1 column names but the data has 2
columns (i.e. invalid file). Added 1 extra default column name for the
first column which is guessed to be row names or an index. Use
setnames() afterwards if this guess is not correct, or fix the file
write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on
line 20. Expected 2 fields but found 3. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?
Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.
First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")
Next, we obtain a list of files in the directory to which we unzipped the CSV files.
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with do.call(), and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.
library(data.table)
data <- do.call(rbind,lapply(thePokemonFiles,fread))
head(data)
tail(data)
...and the output:
> head(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1: 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
2: 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
3: 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4: 4 Charmander Fire 309 39 52 43 60 50 65
5: 5 Charmeleon Fire 405 58 64 58 80 65 80
6: 6 Charizard Fire Flying 534 78 84 78 109 85 100
Generation
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
> tail(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk
1: 895 Regidrago Dragon 580 200 100 50 100
2: 896 Glastrier Ice 580 100 145 130 65
3: 897 Spectrier Ghost 580 100 65 60 145
4: 898 Calyrex Psychic Grass 500 100 80 80 80
5: 898 Calyrex Ice Rider Psychic Ice 680 100 165 150 85
6: 898 Calyrex Shadow Rider Psychic Ghost 680 100 85 80 165
Sp. Def Speed Generation
1: 50 80 8
2: 110 30 8
3: 80 130 8
4: 80 80 8
5: 130 50 8
6: 100 150 8
>
I have following column 'checks' in my data frame 'B' which has input statments in different rows. These statements have a variable 'abc' , and corresponding to them there is a value entry as well.
The entry done are manual and are not coherent for each entry. I have to extract just 'abc' and followed by its 'value'
< B$checks
rows Checks
[1] there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue
[2] abc(107 to 109) xyz 115 jbo xyz 104 optim
[3] problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem
[4] abc_107 xyz 116 dor problem
[5] surevy done , no approximation issues abc 103 xyz 109 crux xyz 104
[6] ping test ok abc(86 rxlevel 84
[7] field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi
[8] abc 89 xyz 99 so as the user has no problem , check ping test
Expected output
rows Variable Value
[1] abc 96
[2] abc 107
[3] abc 95
[4] abc 107
[5] abc 103
[6] abc 86
[7] abc 86
[8] abc 89
I tried the following using references under similar queries
usisng str_match
library(stringr)
m1 <- str_match(B$checks, "abc.*?([0-200.]{1,})") # value is between 0 to 200
which yielded some thing like below
row var value
1 abc-96 xyz 450 0
2 abc(10 10
3 abc 95 1 1
4 abc_10 10
5 abc 10 10
6 NA NA
7 NA NA
8 NA NA
Then I tried the following
B$Checks <- gsub("-", " ", B$Checks)
B$Checks <- gsub("/", " ", B$Checks)
B$Checks <- gsub("_", " ", B$Checks)
B$Checks <- gsub(":", " ", B$Checks)
B$Checks <- gsub(")", " ", B$Checks)
B$Checks <- gsub("((((", " ", B$Checks)
B$Checks <- gsub(".*abc", "abc", B$Checks)
B$Checks <- gsub("[[:punct:]]", " ", B$Checks)
regexp <- "[[:digit:]]+"
m <- str_extract(B$Checks, regexp)
m <- as.data.frame(m)
and was able to get the "expected output",
But now I am looking for following
1) Simpler set of commands or way to extract the expected output
2) Get values which are represented as range e.g. I want the below input row
rows Checks
[2] abc(107 to 109) xyz 115 jbo xyz 104 optim
as
output >
rows Variable Value1 Value2
[2] abc 107 109
Need the solution for 1) and 2) as am working on larger data sets with same patterns and lot of mixed Variable-Value combinations.
Thanks in advance.
You need to capture the digits, specifying that you want abc prior to the digits with lookbehind:
Value <- sub(".*(?<=abc)(\\D+)?(\\d*)\\D?.*", "\\2", str, perl=TRUE)
# Value
#[1] "96" "107" "95" "107" "103" "86" "86" "89"
You can then put the values in a data.frame:
B <- data.frame(Variable="abc", Value=as.numeric(Value))
head(B, 3)
# Variable Value
#1 abc 96
#2 abc 107
#3 abc 95
data
str <- c("there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue",
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem",
"abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ",
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi",
"abc 89 xyz 99 so as the user has no problem , check ping test")
Using gsub() twice and magrittr for better readibility:
library(magrittr)
data.frame(
Variable = "abc",
Value = data %>%
gsub(".*(abc.{6}).*", "\\1", .) %>%
gsub("[^0-9]+(\\d+).*", "\\1", .)
)
Variable Value
1 abc 96
2 abc 107
3 abc 95
4 abc 107
5 abc 103
6 abc 86
7 abc 86
8 abc 89
First we get extract abc and the next 6 characters after and then extract the first integer to appear.
data:
data <- c("there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue",
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem",
"abc_107 xyz 116 dor problem ", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ",
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi",
"abc 89 xyz 99 so as the user has no problem , check ping test"
)
Using stringr for manipulating strings and rebus to write readable regex:
library(stringr)
library(rebus)
str_match(checks, pattern = capture("abc") %R% optional(or1(c(SPC, PUNCT))) %R% capture(one_or_more(DGT)))
output:
[,1] [,2] [,3]
[1,] "abc-96" "abc" "96"
[2,] "abc(107" "abc" "107"
[3,] "abc 95" "abc" "95"
[4,] "abc_107" "abc" "107"
[5,] "abc 103" "abc" "103"
[6,] "abc(86" "abc" "86"
[7,] "abc-86" "abc" "86"
[8,] "abc 89" "abc" "89"
data:
checks <- c("there was no problem reported measures abc-96 xyz 450 327bbb11869 xyz 113 aaa 4 poc 470 b 3 surveyor issue",
"abc(107 to 109) xyz 115 jio xyz 104 optim", "problemm with caller abc 95 19468 4g xyz 103 91960 1 Remarks new loc reqd is problem",
"abc_107 xyz 116 dor problem", "surevy done , no approximation issues abc 103 xyz 109 crux xyz 104 ",
"ping test ok abc(86 rxlevel 84", "field is clean , can be used to buiild the required set up abc-86 xyz 94 Digital DSL No Building class Residential Building Type Multi",
"abc 89 xyz 99 so as the user has no problem , check ping test")
I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie
For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)
I am working on a summer project. To grab course information from my school website.
I start off by going here: http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=
to gather the course departments.
Then I grab info from pages like this one.
I have what I need filtered down to a list like:
[1] "91091 211 01 PRINC OF FINANCIAL ACCOUNTING 3.0 55 22 33 0 MW 12:45PM 02:05PM BAB 106 Rose-Green E"
[2] "91092 211 02 PRINC OF FINANCIAL ACCOUNTING 3.0 53 18 35 0 TR 09:35AM 10:55AM BAB 123 STAFF"
[3] "91093 211 03 PRINC OF FINANCIAL ACCOUNTING 3.0 48 29 19 0 TR 05:30PM 06:50PM BAB 220 Hoskins J"
[4] "91094 212 01 MANAGEMENT ACCOUNTING 3.0 55 33 22 0 MWF 11:30AM 12:25PM BAB 106 Hoskins J"
[5] "91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R"
However my issues are as follows:
www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=CS
I need to add the department from each url. In the link I gave, the department was "CS". I need to have that included with each entry.
I need to turn this into a table, or some other object where I can reference the data like
Max Wait
CRN Course Title Credit Enrl Enrl Avail List Days Start End Bldg Room Instructor
------ ---------- ------------------------------ ------ ---- ---- -------- ---- ------- ------- ------- ----- ---------- --------------------
Basically how the data is displayed on the page.
So my end goal is to go through each of those links I grab, get all the course info(except the section type). Then put it into a giant data.frame that has all the courses like this.
Department CRN Course Title Credit MaxEnrl Enrl Avail WaitList Days Start End Bldg Room Instructor
ACC 91095 212 02 MANAGEMENT ACCOUNTING 3.0 55 27 28 0 TR 02:20PM 03:40PM BAB 106 Bryson R
So far I have this working
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=')
uah <- substring(uah[grep('fall2015', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
dep <- readLines(url)
dep <- dep[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', dep)]
dep <- substring(dep, 6)
dep <- foreach(i=dep) %do% i[grep('[[:digit:][:digit:][:digit:][:digit:]][[:digit:][:digit:][:digit:]] [[:digit:][:digit:]]', i)]
dep <- foreach(i=dep) %do% trim(i)
dep <- dep[2:length(dep)]
return(dep)
}
x <- gatherClasses(uah[1])
x <-unlist(x)
I am having trouble split the data in the right places. I am not sure what I should try next.
EDIT:(Working Now)
require(data.table)
require(gdata)
library(foreach)
uah <- readLines('http://www.uah.edu/cgi-bin/schedule.pl?file=sum2015b.html&segment=')
uah <- substring(uah[grep('sum2015b', uah)], 10)
uah <- sub("\\\"(.*)", "", uah)
uah <- paste("http://www.uah.edu" , uah , sep = "")
gatherClasses <- function(url){
L <- readLines(url)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3}", L, value = TRUE)
classes <- read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
classes$department <- unlist(strsplit(url, '='))[3]
return(classes)
}
allClasses = foreach(i=uah) %do% gatherClasses(i)
allClasses <- do.call("rbind", allClasses)
write.table(mydata, "c:/sum2015b.txt", sep="\t")
Read the lines into L, grab the "--- ---- etc." line into Fields and ensure that there is exactly one space at the end. Find the character positions of the spaces and difference them to get the field widths. Finally grep out the data portion and read it in using read.fwf which reads fixed width fields. For example, for Art History:
URL <- "http://www.uah.edu/cgi-bin/schedule.pl?file=fall2015.html&segment=ARH"
L <- readLines(URL)
Fields <- sub(" *$", " ", grep("---", L, value = TRUE))
widths <- diff(c(0, gregexpr(" ", Fields)[[1]]))
Data <- grep("\\d{5} \\d{3} \\d{2}", L, value = TRUE)
read.fwf(textConnection(Data), widths, as.is = TRUE, strip.white = TRUE)
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
1 90628 100 01 ARH SURV:ANCIENT-MEDIEVAL 3 35 27 8 0 TR 12:45PM 02:05PM WIL 168 Joyce L
2 90630 101 01 ARH SURV:RENAISSANCE-MODERN 3 35 14 21 0 MW 12:45PM 02:05PM WIL 168 Stewart D
3 90631 101 02 ARH SURV:RENAISSANCE-MODERN 3 35 8 27 0 MW 03:55PM 05:15PM WIL 168 Stewart D
4 92269 101 03 ARH SURV:RENAISSANCE-MODERN 3 35 5 30 0 TR 11:10AM 12:30PM WIL 168 Shapiro Guanlao M
5 90632 101 04 ARH SURV:RENAISSANCE-MODERN 3 35 13 22 0 TR 02:20PM 03:40PM WIL 168 Shapiro Guanlao M
6 90633 301 01 ANCIENT GREEK ART 3 18 3 15 0 MW 02:20PM 03:40PM WIL 168 Joyce L
7 92266 306 01 COLLAPSE OF CIVILIZATIONS 3 10 4 6 0 TR 12:45PM 02:05PM SST 205 Sever T
8 W 90634 309 01 CONTEMPORARY ART & ISSUES 3 18 10 8 0 TR 09:35AM 10:55AM WIL 168 Stewart D
9 90635 320 01 ST: MODERN ARCHITECTURE 3 12 0 12 0 TR 11:10AM 12:30PM WIL 172 Takacs T
10 90636 400 01 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Joyce L
11 90637 400 02 SENIOR THESIS 3 0 0 0 0 TBA TBA TBA TBA Stewart D
I wrote and donated that schedule.pl script about 20 years ago because they simply published the flat mainframe files of all the courses on offer for each session. The script's job is to break up the whole set and present it in human-consumable chunks. (That, and back then a browser would choke on that much data.) I understand from one of the former UAH IT people that they tried to do away with it once, but got a great hew and cry from users, so they figured out how to keep it working.
It would be easier for you to ask the UAH IT folks if you can't just retrieve the underlying flat file. It used to be on a public-facing URL, but like I said, that was about 20 years ago, so I don't recall the specifics. The output you see when viewing courses is the same as the flat file, but the flat file contains every department, so you don't have to fetch each separately.
I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278