Importing csv data into OpenTSDB - opentsdb

I have successfully installed OpenTSDB ontop of a Cloudera Hadoop/HBase cluster.
My question is, I have reams of historical 1 minute stock data that looks like this:
"Date","Time","Open","High","Low","Close","Volume"
12/30/2002,0930,24.53,24.65,24.53,24.65,762200
12/30/2002,0931,24.65,24.68,24.52,24.6,90400
.....
From the documentation in the QS Guide, it says in the Batch imports section:
./tsdb import your-file
When I try this on my data it throws an unhelpful exception.
Any hints on how to import this into OpenTSDB? Thanks.

You need to write a script to transform your CSV in something that is in the OpenTSDB format. The general format for OpenTSDB is metric timestamp value tags
For instance your sample could be written as follows:
stock.open 1041269400 24.53 symbol=XXX
stock.high 1041269400 24.65 symbol=XXX
stock.low 1041269400 24.53 symbol=XXX
stock.close 1041269400 24.65 symbol=XXX
stock.volume 1041269400 762200 symbol=XXX
stock.open 1041269460 24.65 symbol=XXX
stock.high 1041269460 24.68 symbol=XXX
stock.low 1041269460 24.52 symbol=XXX
stock.close 1041269460 24.6 symbol=XXX
stock.volume 1041269460 90400 symbol=XXX
Although since it appears that you're working with 1-minute periods, the open/close is redundant, so maybe this would be more appropriate:
stock.quote.1m 1041269340 24.53 symbol=XXX
stock.quote.1m 1041269400 24.65 symbol=XXX
stock.quote.1m 1041269460 24.6 symbol=XXX

I wrote a small csv importer for opentsdb.
https://github.com/soeren-lubitz/csv-to-opentsdb
It works for CSV-Files in form of
Timestamp,Foo,Bar
1483342774,42.1,23.2
Hope this helps. Feedback would be appreciated.

Related

want to copy table from database of default instance to another instance database in sql server 2012 using RODBC

I am able to fetch the table of default instance database. All I need to do is just copy that fetch data to save in named instance database using this RODBC, Any help can appreciate. Advance Thanks.
> library("RODBC")
> odbcChannel <- odbcConnect("SasDatabase")
> odbcClose(odbcChannel)
> odbcChannel <- odbcConnect("SasDatabase")
> sqlFetch(odbcChannel, "PR0__LOG1")
Fetched Data
[ DateTime Temp1 Temp2 PK_identity
1 2018-08-27 09:59:00 51 151 1
2 2018-08-27 10:00:00 11 11 2
3 2018-08-27 10:01:00 71 71 3
4 2018-08-27 10:02:00 31 131 4
Closing Conn
odbcClose(odbcChannel)
Want to copy this fetched data in another instance database.
Your question is not really clear, but it sounds like you want to upload the fetched data to a second database. If so then use RODBC or similar (depends on database type) to connect to the second database then you can use DBI function to upload. Some examples would be :
DBI::dbAppendTable()
or
DBI::dbSendQuery()
Any answer will need more detail about the second database instance to be more specific.
An excellent resource for R and databases is https://db.rstudio.com/

Loading wikidata dump

I'm loading all geographic entries (Q56061) from wikidata json dump.
Whole dump contains about 16M entries according to Wikidata:Statistics page.
Using python3.4 + ijson + libyajl2 it comes to take about 93 hours of CPU (AMD Phenom II X4 945 3GHz) time just to parse the file.
Using online sequential item queries for total of 2.3M entries of interest comes to take about 134 hours.
Is there some more optimal way to perform this task?
(maybe, something like openstreetmap pdf format and osmosis tool)
My loading code and estimations were wrong.
Using ijson.backends.yajl2_cffi gives about 15 hours for full parsing + filtering + storing to database.

How to avoid reading the whole URL into the memory in R by being selective in lines

I am new to R scripting. I am trying to read some data from a url which has a list of all the available rain gauges. Following are a few lines of the file:
ACW00011604 17.1167 -61.7833 10.1 ST JOHNS COOLIDGE FLD
ACW00011647 17.1333 -61.7833 19.2 ST JOHNS
AE000041196 25.3330 55.5170 34.0 SHARJAH INTER. AIRP GSN 41196
I am reading the file using read.fwf as it is a fixed width file. but I don't need the whole file, I want to be selective and save the lines which has some specific criteria like starting with US. Is that possible in R? If possible, would it be preferred to do so? Any comment would be appreciate.

R quandl : couldn't connect to host

I am begining to use Quandl facilities to import datasets to R with Quandl R API. It appears to be the easiest thing. However I have a problem. The below pasted snipet of code does not work (for me). It returns an error.
library(Quandl)
my_quandl_dtst <- Quandl("DOE/RBRTE")
Error in function (type, msg, asError = TRUE) : couldn't connect to host
What could be the cause of the problem?
I searched this site and found some solutions, also the one below, but it does not work for me.
set_config(use_proxy(url='your.proxy.url',port,username,password))
On the other hand, read.csv with url pasted from quandl website export dataset facility works:
my_quandl_dtst <- read.csv('http://www.quandl.com/api/v1/datasets/DOE/RBRTE.csv?', colClasses = c('Date' = 'Date'))
I would realy like to use the Quandl library, since using it would make my code cleaner. Therefore I would appreciate any help. Thanks in advance.
Ok, I found the solution, I had to set RCurlOptions, because the Quandl function uses getURL() to download data from url. But I had to use options() function as well. So:
options(RCurlOptions = list(proxy = "my.proxy", proxyport = my.proxyport.number))
head(quandldata <- Quandl("NSE/OIL"))
Date Open High Low Last Close Total Trade Quantity Turnover (Lacs)
1 2014-03-03 453.5 460.05 450.10 450.30 451.30 90347 410.08
2 2014-02-28 440.0 460.00 440.00 457.60 455.55 565074 2544.66
3 2014-02-26 446.2 450.95 440.00 440.65 440.60 179055 794.24
4 2014-02-25 445.1 451.75 445.10 446.60 447.20 86858 389.38
5 2014-02-24 443.0 449.50 443.00 446.50 446.30 81197 362.33
6 2014-02-21 447.9 448.65 442.95 445.50 446.80 95791 427.32
I guess you need to check if the domain quand1.com accepts remote Connections to the RBRTE.csv file.

Graphite returning incorrect datapoint

I downloaded statsd and graphite 0.9.x
I used the stats-client provided with source of statsd as follows:
./statsd-client.sh 'development.com.alpha.operation.testing.rate:1|c'
I did the above operation 10 times.
Then I tried querying for a summary for last 24 hours:
http://example.com/render?format=json&target=summarize(stats.development.com.alpha.operation.testing.rate,
"24hours", "sum",true)&from=-24hours&tz=UTC
I get 1 datapoint as follows:
"datapoints": [[0.0, 1386277560]]}]
Why I am getting 0.0? Even Graphite Composer does not display anything
I was expecting a value of "10" as I performed the operation 10 times. What did I do wrong?
storage-schemas.conf
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d
Please help me understand the problem.
EDIT:
As per answer below, I changed storage-aggregation and I get following response after running whisper-info on metric_file.wsp. But I am still getting "0.0" as value in datapoint and Graphite browser does not display anything.
maxRetention: 86400
xFilesFactor: 0.0
aggregationMethod: sum
fileSize: 17308
Archive 0
retention: 86400
secondsPerPoint: 60
points: 1440
size: 17280
offset: 28
I also looked at stats_counts Tree as suggested in another answer, but its the same.
What is wrong with my setup. I am using default setting for everything but the changes suggested by an answer below in storage-aggregation
Within the whisper package, you will get a script- whisper-info.py. Invoke it on the appropriate metric file-
/whisper-info.py /opt/graphite/storage/whisper/alpha/beta/charlie.wsp
You will get something like this-
maxRetention: 31536000
xFilesFactor: 0.0
aggregationMethod: sum
fileSize: 1261468
Archive 0
retention: 31536000
secondsPerPoint: 300
points: 105120
size: 1261440
offset: 28
Here, make sure that aggregationMethod is sum, and xFilesFactor is 0.0. Most probably it is not, since this isn't graphite's default behavior. Now make a regex that picks up your metrics and put it at the beginning of the config file storage-aggregation.conf. This will ensure that the newly created metrics follow this new aggregation rule. You can read more about how xFilesFactor works here.
Have you tried using the stats_counts tree instead of stats? StatsD populates both for regular counters. stats by default does some fancy averaging which can tend make low-intensity stat signals disappear, whereas stats_counts just gives you the straight-up count, which sounds like what you want.

Resources