Pandas MulitIndex forces Datetime.date objects to Timestamp objects - datetime

I recently updated Pandas and found this strange behaviour which broke some of my existing code.
I was using a column of Datetime.date objects as a the second level in a two-level MulitIndex.
However, when setting the index with the latest version, the Datetime.date objects are converted to Timestamp objects with 00:00:00 as the time component:
>>> pd.__version__
'0.15.1'
>>> df
0 ID date
0 0.486567 10 2014-11-12
1 0.214374 20 2014-11-13
>>> df.date[0]
datetime.date(2014, 11, 12)
>>> df.set_index(['ID', 'date']).index[0]
(10, Timestamp('2014-11-12 00:00:00'))
This doesn't happen with version 0.14 or older, nor does it happen for a single columns of dates set to index, only for MulitIndices.
There is a hack to get around it, setting the dates to a single level index, adding the other level and then swapping:
>>> df.set_index('date').set_index('ID', append=True).index.swaplevel(0, 1)[0]
(10, datetime.date(2014, 11, 12))
This seems strange and I wondered was it intentional and whether there is a proper way to use datetime.date objects in the new version.

see here
Their was an inconsistency in how date-likes (datetime.date,datetime.datetime,Timestamp) were inferred in a MultiIndex level. This led to the creation of an object dtyped Index rather than a DatetimeIndex. datetime.date are second class objects in pandas as they are not efficiently represented.
If you really really want to create this, you can do this:
In [8]: pd.MultiIndex.from_arrays([Index([datetime.date(2013,1,1)]),['a']])
Out[8]:
MultiIndex(levels=[[2013-01-01], [u'a']],
labels=[[0], [0]])

We came across the same issue and it is still a problem in 0.16. We consider it a bug as it is inconsistent with the operation of creating a single index, and only occurs with multiindex. Why silently change the type if we choose to have it as datetime.date? Set_index should just set the index without changing things.
We don't need the time component. If we wanted to speed things up and do it more efficiently by using a timestamp we should be able to choose that.
It breaks all the code where the index is converted back and forth between columns and index when the table is manipulated (pivoting etc, as it does silent type conversion). Also breaks interaction with other applications and code we have no control over.

Related

how to get list of Auto-IVC component output names

I'm switching over to using the Auto-IVC component as opposed to the IndepVar component. I'd like to be able to get a list of the promoted output names of the Auto-IVC component, so I can then use them to go and pull the appropriate value out of a configuration file and set the values that way. This will get rid of some boilerplate.
p.model._auto_ivc.list_outputs()
returns an empty list. It seems that p.model__dict__ has this information encoded in it, but I don't know exactly what is going on there so I am wondering if there is an easier way to do it.
To avoid confusion from future readers, I assume you meant that you wanted the promoted input names for the variables connected to the auto_ivc outputs.
We don't have a built-in function to do this, but you could do it with a bit of code like this:
seen = set()
for n in p.model._inputs:
src = p.model.get_source(n)
if src.startswith('_auto_ivc.') and src not in seen:
print(src, p.model._var_allprocs_abs2prom['input'][n])
seen.add(src)
assuming 'p' is the name of your Problem instance.
The code above just prints each auto_ivc output name followed by the promoted input it's connected to.
Here's an example of the output when run on one of our simple test cases:
_auto_ivc.v0 par.x

How to combine many records value into one record

As you can see from the below picture I was able to combine two deals (blocked red) but the output should have one result instead of two. If anyone has any solutions on this please advise.
The red blocked component has more than one record, each record has an amount, the sum of all record amount must be shown in a single row.
record1: Amount:100
record2: Amount:200
record3: Amount:500
Merge of all records is following
record: Amount:800
Is it possible to merge many rows into a single row in integromat?
Based on your screenshot you aggregate an incorrect module. Source module in your aggregator has to be set to a module that generates multiple modules, in your case, it is module 10.
You aggregate module 14 that generates for every input module a single output module, there is nothing to aggregate. Module 10 returns for a single input 2 bundles.
Your case:
/---[6]---([14]---[11 aggregator])---
---[10] multiple output bundles
\---[6]---([14]---[11 aggregator])---
Solution:
/---[6]---[14]---\
---([10] [11 aggregator])--- single output bundle
\---[6]---[14]---/
Your scenario has to look like this (Aggregator: Source module = module no.10):

DateTimeParseException while trying to perform ZonedDateTime.parse

Using Java 8u222, I've been trying a silly operation and it incurs in an error that I'm not being able to fully understand. The line code:
ZonedDateTime.parse("2011-07-03T02:20:46+06:00[Asia/Qostanay]");
The error:
java.time.format.DateTimeParseException: Text '2011-07-03T02:20:46+06:00[Asia/Qostanay]' could not be parsed, unparsed text found at index 25
at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1952)
at java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1851)
at java.time.ZonedDateTime.parse(ZonedDateTime.java:597)
at java.time.ZonedDateTime.parse(ZonedDateTime.java:582)
Using the same date (although the timezone could be incorrect, the intention is just testing here), I changed the square bracket's value and it works, I mean:
ZonedDateTime.parse("2011-07-03T02:20:46+06:00[Europe/Busingen]);
It works as expected, as well as other values such:
ZonedDateTime.parse("2011-07-03T02:20:46+06:00[Asia/Ulan_Bator]")
ZonedDateTime.parse("2011-07-03T02:20:46+06:00[SystemV/CST6CDT]")
I found some similar questions such as the one below, but not precisely the same usage that I'm trying / facing.
Error java.time.format.DateTimeParseException: could not be parsed, unparsed text found at index 10
Does someone have an understanding of Java Date API to help me out to grasp what I'm doing wrong here?
Thanks.
Asia/Qostanay is a zone which doesn't exist in the JDK8's list of timezones. It was added later.
If you don't care about the location of the timezone then just splice the [...] part of the string off the end before parsing. Knowing that the time is +06:00 is going to sufficient for almost all purposes.
Alternatively, upgrade to a more recent version of Java.

Configuring scollector to get different frequences for different collectors

I'm working on scollector and I want to have specific frequencies for different collector.
For example:
get info from disk usage every 5 minutes
info from memory every minute
iostat every 30 seconds
and so on...
Here is a part of the conf.toml I made:
FullHost = true
Freq = 60
DisableSelf = true
[[iostat]]
Filter = "iostat"
Freq = 30
[[memory]]
Filter = "memory"
Freq = 60
But I get some error
./scollector -conf="perso.toml" -p
2016/04/19 14:40:45 fatal: main.go:297: extra keys in perso.toml: [iostat iostat.Freq memory memory.Freq]
It seems that I cannot multiply the frequencies.
What should I do to get what I want?
Thank you all
According to scollector documentation, Freq is a global setting, so it's not possible to set different frequencies for each collector. The exception is for external collectors, which may be put in a folder named after the desired frequency (in seconds).
Freq is indeed global setting and interval is usually set to it. Although some collectors override interval to different values e.g. elasticsearch-indices runs every 15 minutes because there's a lot of data to pull.
To change it either
(best) hack scollector code to read and pass freq parameter to every collector
(second best) file a github issue
(last resort) you can just change intervals scollector code in specific collectors and recompile scollector
Well, we might found something.
We create differents folders representing several Freq (0, 30, 60, 120...) and in each folders, we write external collectors we need.
'/etc/collectors/0',
'/etc/collectors/15',
'/etc/collectors/30',
'/etc/collectors/60',
'/etc/collectors/120',
'/etc/collectors/300',
'/etc/collectors/600'
In the conf.toml:
ColDir = "/etc/scollector/collectors"
If we want the internal collectors, we have to rewrite them :(

Titan Graph Queries taking too long to execute

I have a problem with the executing speed of Titan queries.
To be more specific:
I created a property file for my graph using BerkeleyJe which is looking like this:
storage.backend=berkeleyje
storage.directory=/finalGraph_script/graph
Afterwards, i opened the Gremlin.bat to open my Graph.
I set up all the neccessary Index Keys for my nodes:
m = g.getManagementSystem();
username = m.makePropertyKey('username').dataType(String.class).make()
m.buildIndex('byUsername',Vertex.class).addKey(username).unique().buildCompositeIndex()
m.commit()
g.commit()
(all other keys are created the same way...)
I imported a csv file containing about 100 000 lines, each line is producing at least 2 nodes and some edges. All this is done via Batchloading.
That works without a Problem.
Then i execute a groupBy query which is looking like that:
m = g.V.has("imageLink").groupBy{it.imageLink}{it.in("is_on_image").out("is_species")}{it._().species.groupCount().cap.next()}.cap.next()
With this query i want for every node with the property key "imageLink" the number of the different "species". "Species" are also nodes, and can be called by going back the edge "is_on_image" and following the edge "is_species".
Well this is also working like a charm, for my recent nodes. This query is taking about 2 minutes on my local PC.
But now to the problem.
My whole dataset is a csv with 10 million entries. The structure is the same as above, and each line is also creating at least 2 nodes and some edges.
With my local PC i cant even import this set, causing an Memory Exception after 3 days of loading.
So I tried the same on a server with much more RAM and memory. There the Import works, and takes about 1 day. But the groupBy failes after about 3 days.
I actually dont know if the groupBy itself fails, or just the Connection to the Server after that long time.
So my first Question:
In my opinion about 15 million nodes shouldn't be that big deal for a graph database, should it?
Second Question:
Is it normal that it takes so long? Or is there anyway to speed it up using indices? I configured the indices as listet above :(
I don't know which exact information you need for helping me, but please just tell me what you need in addition to that.
Thanks a lot!
Best regards,
Ricardo
EDIT 1: The way im loading the CSV in the Graph:
I'm using this code, i deleted some unneccassry properties, which are also set an property for some nodes, loaded the same way.
bg = new BatchGraph(g, VertexIDType.STRING, 10000)
new File("annotation_nodes_wNothing.csv").eachLine({ final String line ->def (annotationId,species,username,imageLink) = line.split('\t')*.trim();def userVertex = bg.getVertex(username) ?: bg.addVertex(username);def imageVertex = bg.getVertex(imageLink) ?: bg.addVertex(imageLink);def speciesVertex = bg.getVertex(species) ?: bg.addVertex(species);def annotationVertex = bg.getVertex(annotationId) ?: bg.addVertex(annotationId);userVertex.setProperty("username",username);imageVertex.setProperty("imageLink", imageLink);speciesVertex.setProperty("species",species);annotationVertex.setProperty("annotationId", annotationId);def classifies = bg.addEdge(null, userVertex, annotationVertex, "classifies");def is_on_image = bg.addEdge(null, annotationVertex, imageVertex, "is_on_image");def is_species = bg.addEdge(null, annotationVertex, speciesVertex, "is_species");})
bg.commit()
g.commit()

Resources