How to set 'index' for Bokeh TimeSeries - bokeh

Bokeh Newbie here finding that some of the learning examples are not working for me. In particular, I'm trying to use TimeSeries as in this example and set an index like so:
t = TimeSeries(dt[dt['type'] == 0][['value', 'date']], index = 'date')
show(t)
This is generating the following lengthy error message (which disappers when I remove index = 'data' above)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-93-af5f4a3c6c5e> in <module>()
2 print(dt[1:10])
3 dt.set_index('date')
----> 4 t = TimeSeries(dt[dt['type'] == 0][['value', 'date']], index = 'date')
5 show(t)
/Users/anaconda3/lib/python3.5/site-packages/bokeh/charts/builders/timeseries_builder.py in TimeSeries(data, x, y, builder_type, **kws)
100 kws['x'] = x
101 kws['y'] = y
--> 102 return create_and_build(builder_type, data, **kws)
/Users/anaconda3/lib/python3.5/site-packages/bokeh/charts/builder.py in create_and_build(builder_class, *data, **kws)
64 # create a chart to return, since there isn't one already
65 chart_kws = { k:v for k,v in kws.items() if k not in builder_props}
---> 66 chart = Chart(**chart_kws)
67 chart.add_builder(builder)
68 chart.start_plot()
/Users/anaconda3/lib/python3.5/site-packages/bokeh/charts/chart.py in __init__(self, *args, **kwargs)
112 # supported types
113 tools = kwargs.pop('tools', None)
--> 114 super(Chart, self).__init__(*args, **kwargs)
115
116 defaults.apply(self)
/Users/anaconda3/lib/python3.5/site-packages/bokeh/models/plots.py in __init__(self, **kwargs)
76 raise ValueError("Conflicting properties set on plot: background_fill, background_fill_color.")
77
---> 78 super(Plot, self).__init__(**kwargs)
79
80 def select(self, *args, **kwargs):
/Users/anaconda3/lib/python3.5/site-packages/bokeh/model.py in __init__(self, **kwargs)
81 self._id = kwargs.pop("id", make_id())
82 self._document = None
---> 83 super(Model, self).__init__(**kwargs)
84 default_theme.apply_to_model(self)
85
/Users/anaconda3/lib/python3.5/site-packages/bokeh/core/properties.py in __init__(self, **properties)
699
700 for name, value in properties.items():
--> 701 setattr(self, name, value)
702
703 def __setattr__(self, name, value):
/Users/anaconda3/lib/python3.5/site-packages/bokeh/core/properties.py in __setattr__(self, name, value)
720
721 raise AttributeError("unexpected attribute '%s' to %s, %s attributes are %s" %
--> 722 (name, self.__class__.__name__, text, nice_join(matches)))
723
724 def set_from_json(self, name, json, models=None):
AttributeError: unexpected attribute 'index' to Chart, possible attributes are above, background_fill_alpha, background_fill_color, below, border_fill_alpha, border_fill_color, disabled, extra_x_ranges, extra_y_ranges, h_symmetry, hidpi, left, legend, lod_factor, lod_interval, lod_threshold, lod_timeout, logo, min_border, min_border_bottom, min_border_left, min_border_right, min_border_top, name, outline_line_alpha, outline_line_cap, outline_line_color, outline_line_dash, outline_line_dash_offset, outline_line_join, outline_line_width, plot_height, plot_width, renderers, responsive, right, tags, title, title_standoff, title_text_align, title_text_alpha, title_text_baseline, title_text_color, title_text_font, title_text_font_size, title_text_font_style, tool_events, toolbar_location, tools, v_symmetry, webgl, x_mapper_type, x_range, xgrid, xlabel, xscale, y_mapper_type, y_range, ygrid, ylabel or yscale
What is wrong here and how can I fix it? For reference, here's some rows of dt printed out:
value type date index
1 0.55 1 2016-04-12 06:00:00+00:00 1
2 0.55 1 2016-04-12 11:30:00+00:00 2
3 0.55 1 2016-04-12 06:30:00+00:00 3
4 0.55 1 2016-04-12 12:00:00+00:00 4
5 0.55 1 2016-04-12 12:30:00+00:00 5
6 0.55 1 2016-04-12 08:00:00+00:00 6
7 0.55 1 2016-04-12 08:30:00+00:00 7
8 0.55 1 2016-04-13 07:30:00+00:00 8
9 0.55 1 2016-04-12 13:00:00+00:00 9

Related

bs4 Attribute Error while scraping table python

I am trying to scrape a table using bs4. But whenever I iterate over the <tbody> elements, i get the following error: Traceback (most recent call last): File "f:\Python Programs\COVID-19 Notifier\main.py", line 28, in <module> for tr in soup.find('tbody').findAll('tr'): AttributeError: 'NoneType' object has no attribute 'findAll'
I am new to bs4 and have faced this error many times before too. This is the code I am using. Any help would be greatly appreciated as this is an official project to be submitted in a competition and the deadline is near. Thanks in advance. And beautifulsoup4=4.8.2, bs4==0.0.4 and soupsieve==2.0.
My code:
from plyer import notification
import requests
from bs4 import BeautifulSoup
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
def getData(url):
r = requests.get(url)
return r.text
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = getData('https://www.mohfw.gov.in/')
soup = BeautifulSoup(myHtmlData, 'html.parser')
#print(soup.prettify())
myDataStr = ""
for tr in soup.find('tbody').find_all('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh']
for item in itemList[0:22]:
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[2]} & Foreign : {dataList[3]}\nCured : {dataList[4]}\nDeaths : {dataList[5]}"
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)
This line raises the error:
for tr in soup.findAll('tbody').findAll('tr'):
You can only call find_all on a single tag, not a result set returned by another find_all. (findAll is the same as find_all - the latter one is preferably used because it meets the Python PEP 8 styling standard)
According to the documentation:
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
If you're looping through a single table, simply replace the first findAll with find. If multiple tables, store the result set in a variable and loop through it, and you can apply the findAll on a single tag.
This should fix it:
for tr in soup.find('tbody').find_all('tr'):
Multiple tables:
tables = soup.find_all('tbody')
for table in tables:
for tr in table.find_all('tr'):
...
There are a few issues here.
The <tbody> tag is within the comments of the html. BeautifulSoup skips comments, unless you specifically pull those.
Why bother with the getData() function? It's just one line, why not just put that into the code. The extra function doesn't really add efficiency or more readability in the code.
Even when you pull the <tbody> tag, your dataList doesn't have 6 items (you call dataList[5], which will throw and error). I adjusted it, but I don't know if those are the corect numbers. I don't know what each of those vlaues represent, so you may need to fix that. The headers for that data you are pulling are ['S. No.','Name of State / UT','Active Cases*','Cured/Discharged/Migrated*','Deaths**'], so I don't know what Indian : {dataList[2]} & Foreign : are suppose to be.
With that, I don't what those numbers represent, but is it the correct data? Looks like you can pull new data here, but it's not the same numbers in the <tbody>
So, here's to get that other data source...maybe it's more accurate?
import requests
import pandas as pd
jsonData = requests.get('https://www.mohfw.gov.in/data/datanew.json').json()
df = pd.DataFrame(jsonData)
Output:
print(df.to_string())
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 2 Andaman and Nicobar Islands 153 5527 5309 65 146 5569 5358 65 35
1 1 Andhra Pradesh 66944 997462 922977 7541 74231 1009228 927418 7579 28
2 3 Arunachal Pradesh 380 17296 16860 56 453 17430 16921 56 12
3 4 Assam 11918 231069 217991 1160 13942 233453 218339 1172 18
4 5 Bihar 69869 365770 293945 1956 76420 378442 300012 2010 10
5 6 Chandigarh 4273 36404 31704 427 4622 37232 32180 430 04
6 7 Chhattisgarh 121555 605568 477339 6674 123479 622965 492593 6893 22
7 8 Dadra and Nagar Haveli and Daman and Diu 1668 5910 4238 4 1785 6142 4353 4 26
8 10 Delhi 91618 956348 851537 13193 92029 980679 875109 13541 07
9 11 Goa 10228 72224 61032 964 11040 73644 61628 976 30
10 12 Gujarat 92084 453836 355875 5877 100128 467640 361493 6019 24
11 13 Haryana 58597 390989 328809 3583 64057 402843 335143 3643 06
12 14 Himachal Pradesh 11859 82876 69763 1254 12246 84065 70539 1280 02
13 15 Jammu and Kashmir 16094 154407 136221 2092 16993 156344 137240 2111 01
14 16 Jharkhand 40942 184951 142294 1715 43415 190692 145499 1778 20
15 17 Karnataka 196255 1247997 1037857 13885 214330 1274959 1046554 14075 29
16 18 Kerala 156554 1322054 1160472 5028 179311 1350501 1166135 5055 32
17 19 Ladakh 2041 12937 10761 135 2034 13089 10920 135 37
18 20 Lakshadweep 803 1671 867 1 920 1805 884 1 31
19 21 Madhya Pradesh 84957 459195 369375 4863 87640 472785 380208 4937 23
20 22 Maharashtra 701614 4094840 3330747 62479 693632 4161676 3404792 63252 27
21 23 Manipur 513 30047 29153 381 590 30151 29180 381 14
22 24 Meghalaya 1133 15488 14198 157 1238 15631 14236 157 17
23 25 Mizoram 608 5220 4600 12 644 5283 4627 12 15
24 26 Nagaland 384 12800 12322 94 457 12889 12338 94 13
25 27 Odisha 32963 388479 353551 1965 36718 394694 356003 1973 21
26 28 Puducherry 5923 50580 43931 726 6330 51372 44314 728 34
27 29 Punjab 40584 319719 270946 8189 43943 326447 274240 8264 03
28 30 Rajasthan 107157 467875 357329 3389 117294 483273 362526 3453 08
29 31 Sikkim 640 6970 6193 137 693 7037 6207 137 11
30 32 Tamil Nadu 89428 1037711 934966 13317 95048 1051487 943044 13395 33
31 34 Telengana 52726 379494 324840 1928 58148 387106 326997 1961 36
32 33 Tripura 563 34302 33345 394 645 34429 33390 394 16
33 35 Uttarakhand 26980 138010 109058 1972 29949 142349 110379 2021 05
34 36 Uttar Pradesh 259810 976765 706414 10541 273653 1013370 728980 10737 09
35 37 West Bengal 68798 700904 621340 10766 74737 713780 628218 10825 19
36 11111 2428616 16263695 13648159 186920 2552940 16610481 13867997 189544
Here's your code with pulling the comments out
Code:
import requests
from bs4 import BeautifulSoup, Comment
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = requests.get('https://www.mohfw.gov.in/').text
soup = BeautifulSoup(myHtmlData, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
myDataStr = ""
for each in comments:
if 'tbody' in str(each):
soup = BeautifulSoup(each, 'html.parser')
for tr in soup.find('tbody').findAll('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh','Meghalaya']
for item in itemList[0:22]:
w=1
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[0]} & Foreign : {dataList[2]}\nCured : {dataList[3]}\nDeaths : {dataList[4]}" #<-- I changed this
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)

I have try to connect python jupyter notebook with amazon neptune DB Instance, but I got an error like this, what should I do?

This code, I got from the Amazon Neptune Tutorial https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-python.html
But, I got an error like this when I try to run the code in Jupyter Notebook (internal in my laptop).
This is my code
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-1-fae80b27d2c6> in <module>
12 g = graph.traversal().withRemote(remoteConn)
13
---> 14 print(g.V().limit(2).toList())
15
16 remoteConn.close()
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\process\traversal.py in toList(self)
56
57 def toList(self):
---> 58 return list(iter(self))
59
60 def toSet(self):
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\process\traversal.py in __next__(self)
46 def __next__(self):
47 if self.traversers is None:
---> 48 self.traversal_strategies.apply_strategies(self)
49 if self.last_traverser is None:
50 self.last_traverser = next(self.traversers)
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\process\traversal.py in apply_strategies(self, traversal)
571 def apply_strategies(self, traversal):
572 for traversal_strategy in self.traversal_strategies:
--> 573 traversal_strategy.apply(traversal)
574
575 def apply_async_strategies(self, traversal):
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\driver\remote_connection.py in apply(self, traversal)
147 def apply(self, traversal):
148 if traversal.traversers is None:
--> 149 remote_traversal = self.remote_connection.submit(traversal.bytecode)
150 traversal.remote_results = remote_traversal
151 traversal.side_effects = remote_traversal.side_effects
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\driver\driver_remote_connection.py in submit(self, bytecode)
54
55 def submit(self, bytecode):
---> 56 result_set = self._client.submit(bytecode, request_options=self._extract_request_options(bytecode))
57 results = result_set.all().result()
58 side_effects = RemoteTraversalSideEffects(result_set.request_id, self._client,
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\driver\client.py in submit(self, message, bindings, request_options)
125
126 def submit(self, message, bindings=None, request_options=None):
--> 127 return self.submitAsync(message, bindings=bindings, request_options=request_options).result()
128
129 def submitAsync(self, message, bindings=None, request_options=None):
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\driver\client.py in submitAsync(self, message, bindings, request_options)
146 if request_options:
147 message.args.update(request_options)
--> 148 return conn.write(message)
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\driver\connection.py in write(self, request_message)
53 def write(self, request_message):
54 if not self._inited:
---> 55 self.connect()
56 request_id = str(uuid.uuid4())
57 result_set = resultset.ResultSet(queue.Queue(), request_id)
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\driver\connection.py in connect(self)
43 self._transport.close()
44 self._transport = self._transport_factory()
---> 45 self._transport.connect(self._url, self._headers)
46 self._protocol.connection_made(self._transport)
47 self._inited = True
C:\ProgramData\Anaconda3\lib\site-packages\gremlin_python\driver\tornado\transport.py in connect(self, url, headers)
38 if headers:
39 url = httpclient.HTTPRequest(url, headers=headers)
---> 40 self._ws = self._loop.run_sync(
41 lambda: websocket.websocket_connect(url, compression_options=self._compression_options))
42
~\AppData\Roaming\Python\Python38\site-packages\tornado\ioloop.py in run_sync(self, func, timeout)
456 if not future_cell[0].done():
457 raise TimeoutError('Operation timed out after %s seconds' % timeout)
--> 458 return future_cell[0].result()
459
460 def time(self):
~\AppData\Roaming\Python\Python38\site-packages\tornado\concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
~\AppData\Roaming\Python\Python38\site-packages\tornado\util.py in raise_exc_info(exc_info)
~\AppData\Roaming\Python\Python38\site-packages\tornado\stack_context.py in wrapped(*args, **kwargs)
314 if top is None:
315 try:
--> 316 ret = fn(*args, **kwargs)
317 except:
318 exc = sys.exc_info()
~\AppData\Roaming\Python\Python38\site-packages\tornado\simple_httpclient.py in _on_timeout(self, info)
305 error_message = "Timeout {0}".format(info) if info else "Timeout"
306 if self.final_callback is not None:
--> 307 raise HTTPError(599, error_message)
308
309 def _remove_timeout(self):
HTTPError: HTTP 599: Timeout while connecting
This is the error that I got.
What should I do?
Amazon Neptune runs inside a private VPC. This means that you will not be able to connect to it from your laptop unless you have setup a method to connect into the VPC. This can be done via setting up an ssh tunnel through a bastion host, ssh port forwarding, or another way such as a client VPN. Here is an example of how to do this for Document DB which is basically the same as for Neptune except that Neptune uses port 8182.
https://docs.aws.amazon.com/documentdb/latest/developerguide/connect-from-outside-a-vpc.html

How to resolve ModuleNotFoundError in jupyter notebook?

I am not able to understand how to create the link for the package as I have installed the package but it does not work. I am not able to understand why it is not working.
Kindly, help me out with this project. I need to make a "Sales Automation Chatbot". But if libraries are not working then it is a problem for me.
from chatterbot import ChatBot
from chatterbot.trainers import ListTrainer
import spacy
Bot = ChatBot(name='PyBot', read_only=True,
logic_adapters=['chatterbot.logic.MathematicalEvaluation',
'chatterbot.logic.BestMatch'])
Error: ModuleNotFoundError Traceback (most recent call last)
<ipython-input-5-298f9ccedff4> in <module>
1 Bot = ChatBot(name='PyBot', read_only=True,
2 logic_adapters=['chatterbot.logic.MathematicalEvaluation',
----> 3 'chatterbot.logic.BestMatch'])
~\Anaconda\envs\chatbot\lib\site-packages\chatterbot\chatterbot.py in __init__(self, name, **kwargs)
26 self.logic_adapters = []
27
---> 28 self.storage = utils.initialize_class(storage_adapter, **kwargs)
29
30 primary_search_algorithm = IndexedTextSearch(self, **kwargs)
~\Anaconda\envs\chatbot\lib\site-packages\chatterbot\utils.py in initialize_class(data, *args, **kwargs)
31 Class = import_module(data)
32
---> 33 return Class(*args, **kwargs)
34
35
~\Anaconda\envs\chatbot\lib\site-packages\chatterbot\storage\sql_storage.py in __init__(self, **kwargs)
18
19 def __init__(self, **kwargs):
---> 20 super().__init__(**kwargs)
21
22 from sqlalchemy import create_engine
~\Anaconda\envs\chatbot\lib\site-packages\chatterbot\storage\storage_adapter.py in __init__(self, *args, **kwargs)
19
20 self.tagger = PosLemmaTagger(language=kwargs.get(
---> 21 'tagger_language', languages.ENG
22 ))
23
~\Anaconda\envs\chatbot\lib\site-packages\chatterbot\tagging.py in __init__(self, language)
11 self.punctuation_table = str.maketrans(dict.fromkeys(string.punctuation))
12
---> 13 self.nlp = spacy.load(self.language.ISO_639_1.lower())
14
15 def get_bigram_pair_string(self, text):
~\Anaconda\envs\chatbot\lib\site-packages\spacy\__init__.py in load(name, **overrides)
28 if depr_path not in (True, False, None):
29 warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning)
---> 30 return util.load_model(name, **overrides)
31
32
~\Anaconda\envs\chatbot\lib\site-packages\spacy\util.py in load_model(name, **overrides)
168 return load_model_from_link(name, **overrides)
169 if is_package(name): # installed as package
--> 170 return load_model_from_package(name, **overrides)
171 if Path(name).exists(): # path to model data directory
172 return load_model_from_path(Path(name), **overrides)
~\Anaconda\envs\chatbot\lib\site-packages\spacy\util.py in load_model_from_package(name, **overrides)
188 def load_model_from_package(name, **overrides):
189 """Load a model from an installed package."""
--> 190 cls = importlib.import_module(name)
191 return cls.load(**overrides)
192
~\Anaconda\envs\chatbot\lib\importlib\__init__.py in import_module(name, package)
125 break
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)
128
129
~\Anaconda\envs\chatbot\lib\importlib\_bootstrap.py in _gcd_import(name, package, level)
~\Anaconda\envs\chatbot\lib\importlib\_bootstrap.py in _find_and_load(name, import_)
~\Anaconda\envs\chatbot\lib\importlib\_bootstrap.py in _find_and_load_unlocked(name, import_)
ModuleNotFoundError: No module named 'en'

R split data into categories

I am trying to find the most efficient way to split a list of numbers into bins by value and then calculate a cumulative sum for each successive category.
I can't seem to get the value categories from this for the plot.
> scores
[1] 115 119 119 134 121 128 128 152 97 108 98 130 108 110 111 122 106 142 143 140 141 151 125 126
> table(cut(scores,breaks=10))
(96.9,102] (102,108] (108,113] (113,119] (119,124] (124,130] (130,136] (136,141] (141,147] (147,152]
2 1 4 1 4 5 1 2 2 2
> cumsum(table(cut(scores,breaks=10)))
(96.9,102] (102,108] (108,113] (113,119] (119,124] (124,130] (130,136] (136,141] (141,147] (147,152]
2 3 7 8 12 17 18 20 22 24
> plot(100*cumsum(table(cut(scores,breaks=10)))/length(scores),ylab="percent of scores")
> lines(100*cumsum(table(cut(scores,breaks=10)))/length(scores))
This produces an acceptable plot, which contains index values (2,4,6...). How can I get the values 96.9, 102, etc... Is there a better way to do this?
You need to set xaxt = "n" to force the plot not to display the x axis labels, and display them by yourself using axis while retrieving them using names
plot(100*cumsum(table(cut(scores,breaks=10)))/length(scores),ylab="percent of scores", xaxt = "n")
lines(100*cumsum(table(cut(scores,breaks=10)))/length(scores))
axis(1, 1:10, names(table(cut(scores,breaks=10))))

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources