OpenTSDB : Java API for Batch Insert and Bulk Upload from file - opentsdb

Is there any Java API in OpenTSDB to perform the following tasks :
Batch insert multiple metrics (with multiple datapoints for each metric).
Bulk import from a file.
I get the data in a CSV file as follows :
timestamp,tag,metric1,metric2,metric3,metric4,metric5
1315000846,Test_01,62.5,82.5,52.5,10.5,85.5
1315000850,Test_02,52.5,72.5,42.5,5.5,75.5
The time-series data for the above two lines will be as follows :
metric1 1315000846 62.5 tag=Test_01
metric2 1315000846 82.5 tag=Test_01
metric3 1315000846 52.5 tag=Test_01
metric4 1315000846 10.5 tag=Test_01
metric5 1315000846 85.5 tag=Test_01
metric1 1315000850 52.5 tag=Test_02
metric2 1315000850 72.5 tag=Test_02
metric3 1315000850 42.5 tag=Test_02
metric4 1315000850 5.5 tag=Test_02
metric5 1315000850 75.5 tag=Test_02
I am thinking of two ways :
Batch insert the above datapoints using some api (if available)
Save the above content in a new file and bulk upload this file using some api (if available)
I have gone through WritableDataPoints, using which we can add multiple datapoints.
But I am not sure if we can add multiple metrics using the same instance (setSeries() takes only a single metric name).

I ended up using WritableDataPoints.
I had a look at the TextImporter source code, and found out that they maintain a map of WritableDataPoints, with key as metric + tags and reuse the same WritableDataPoints object to add the new data points for a metric with same tags.

Related

How to display a set of colours that is similar to a specific colour?

I am trying to get a display of the actual build-in colours in R, rather than names.
If I use colours(), I get a display of all names. Now I would, for example, like to see 10 colours close to "dodgerblue" together with their names in the console.
Is there a way to do this?
You can try next solution:
Install package rcolorutils
You can find out, which colors are close to your, f.e.:
nearRcolor("dodgerblue", "rgb", dist = 75) #dist - depth of search
An output:
0.0 19.8 48.4 49.4 53.0
"dodgerblue" "dodgerblue2" "deepskyblue2" "royalblue1" "royalblue2"
55.8 57.6 59.2 60.4 70.1
"deepskyblue" "dodgerblue3" "deepskyblue3" "royalblue" "steelblue2"
70.1 72.4 75.4 78.8 79.8
"steelblue3" "cornflowerblue" "royalblue3"
Let's look to our colors:
nearRcolor("dodgerblue", "rgb", dist = 75) %>%
plotCol(nrow = 2)
For anyone else who is looking: I just found a second way to access a first overview:
demo(colors)
Displays all names in the respective colour, and also gives the plots displayed above for defined colours.

Missing data warning R

I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.

Retrieve data that have similar values in one column

I have the following dataset:
Class Value
Drive 9.5
Analyser 6.35
GameGUI 12.09
Drive 9.5
Analyser 5.5
GameGUI 2.69
Drive 9.5
Analyser 9.10
GameGUI 6.1
I want to retrieve the classes that have similar values, which would be in the case of the example above is Drive. To do that I have the following command:
dataset[as.logical(ave(dataset$Value, dataset$Class, FUN = function(x) all(x==1))), ]
But this command returns only the classes that their values is always one. What I want is different, I don't want to give a specific value.

Import data from a subset of subjects in R

I am working with a data set with a combined 300 million rows, split over 5 csv files. The data contains weight measurements of users over 5 years (one file per year). As calculations take ages in this massive data set, I would like to work with a subset of users to create the code. I've used the nrows function to import only the first 50000 lines of each file. However, one user may have 400 weight measurements in the file for year 2014 but only 240 in year 2015. I therefore don't get the same set of users from each file when I import with the nrows function. I am wondering whether there is a way to import the data of the first 1000 users in each file?
The data looks like this in all files:
user_ID date_local weight_kg
0002a3e897bd47a575a720b84aad6e01632d2069 2016-01-07 99.2
0002a3e897bd47a575a720b84aad6e01632d2069 2016-02-08 99.6
0002a3e897bd47a575a720b84aad6e01632d2069 2016-02-10 99.5
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-03-13 99.1
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-04-20 78.2
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-05-02 78.3
000115ff92b4f18452df4a1e5806d4dd771de64c 2016-05-07 78.9
0002b526e65ecdd01f3a373988e63a44d034c5d4 2016-08-15 82.1
0002b526e65ecdd01f3a373988e63a44d034c5d4 2016-08-22 82.6
Thanks a lot in advance!
If you have grep on your system you can combine it with pipe and read.table to read only rows that match a pattern. Using your example data, for example, you could read only users 001 and 002 like this. You'll need to add the headers back later as they won't match the pattern.
mydata <- read.csv(pipe('grep "^00[12]" "mydata.csv"'),
colClasses = c("character", "Date", "numeric"),
header = FALSE)
I'm not sure what the pattern is for your user_ID: you give 001 as an example but state that you want the first 1000. If that is 0001 - 1000, a pattern for grep might be something like ^[01][0-9]{3}.

Memory problems with large-scale social network visualization using R and Cytoscape

I'm relatively new to R and am trying to solve the following problem:
I work on a Windows 7 Enterprise platform with the 32bit version of R
and have about 3GB of RAM on my machine. I have large-scale social
network data (c. 7,000 vertices and c. 30,000 edges) which are
currently stored in my SQL database. I have managed to pull this data
(omitting vertex and edge attributes) into an R dataframe and then
into an igraph object. For further analysis and visualization, I would
now like to push this igraph into Cytoscape using RCytoscape.
Currently, my approach is to convert the igraph object into an
graphNEL object since RCytoscape seems to work well with this object
type. (The igraph plotting functions are much too slow and lack
further analysis functionality.)
Unfortunately, I always run into memory issues when running this
script. It has worked previously with smaller networks though.
Does anyone have an idea on how to solve this issue? Or can you
recommend any other visualization and analysis tools that work well
with R and can handle such large-scale data?
Sorry for taking several days to get back to you.
I just ran some tests in which
1) an adjacency matrix is created in R
2) an R graphNEL is then created from the matrix
3) (optionally) node & edge attributes are added
4) a CytoscapeWindow is created, displayed, and layed out, and redrawn
(all times are in seconds)
nodes edges attributes? matrix graph cw display layout redraw total
70 35 no 0.001 0.001 0.5 5.7 2.5 0.016 9.4
70 0 no 0.033 0.001 0.2 4.2 0.5 0.49 5.6
700 350 no 0.198 0.036 6.0 8.3 1.6 0.037 16.7
1000 500 no 0.64 0.07 12.0 9.8 1.8 0.09 24.9
1000 500 yes 0.42 30.99 15.7 29.9 1.7 0.08 79.4
2000 1000 no 3.5 0.30 73.5 14.9 4.8 0.08 96.6
2500 1250 no 2.7 0.45 127.1 18.3 11.5 0.09 160.7
3000 1500 no 4.2 0.46 236.8 19.6 10.7 0.10 272.8
4000 2000 no 8.4 0.98 502.2 27.9 21.4 0.14 561.8
To my complete surprise, and chagrin, there is an exponential slowdown in 'cw' (the new.CytoscapeWindow method) --which makes no sense at all. It may be that your memory exhaustion is related to that, and is quite fixable.
I will explore this, and probably have a fix in the next week.
By the way, did you know that you can create a graphNEL directly from an adjacency matrix?
g = new ("graphAM", adjMat = matrix, edgemode="directed")
Thanks, Ignacio, for your most helpful report. I should have done these timing tests long ago!
Paul
It has been a while since I used Cytoscape so I am not exactly sure how to do it, but the manual states that you can use text files as input using the "Table Import" feature.
In igraph you can use the write.graph() function to export a graph in a bunch of ways. This way you can circumvent having to convert to a graphNEL object which might be enough to not run out of memory.

Resources