Oracle has a table function called SDO_JOIN that is used to do join tables based on spatial relations. An example query to find what neighbourhood a house is in is something like this:
select
house.address,
neighbourhood.name
from table(sdo_join('HOUSE', 'GEOMETRY', 'NEIGHBOURHOOD', 'GEOMETRY', 'mask=INSIDE')) a
inner join house
on a.rowid1 = house.rowid
inner join neighbourhood
on a.rowid2 = neighbourhood.rowid;
But I get the same result by just doing a regular join with a spatial relation in the on clause:
select
house.address,
neighbourhood.name
from house
inner join neighbourhood
on sdo_inside(house.geometry, neighbourhood.geometry) = 'TRUE';
I prefer the second method because I think it's easier to understand what exactly is happening, but I wasn't able to find any Oracle documentation on whether or not this is the proper way to do a spatial join.
Is there any difference between these two methods? If there is, what? If there isn't, which style is more common?
The difference is in performance.
The first approach (SDO_JOIN) isolates the candidates by matching the RTREE indexes on each table.
The second approach will search the HOUSE table for each geometry of the NEIGHBORHOOD table.
So much depends on how large your tables are, and in particular, how large the NEIGHBORHOOD table is - or more precisely, how many rows of the NEIGHBORHOOD table your query actually uses. If the NEIGHBORHOOD table is small (less than 1000 rows) then the second approach is good (and the size of the HOUSE table does not matter).
On the other hand, if you need to match millions of houses and millions of neighborhoods, then the SDO_JOIN approach will be more efficient.
Note that the SDO_INSIDE approach can be efficient too: just make sure you enable SPATIAL_VECTOR_ACCELERATION (only if you use Oracle 12.1 or 12.2 and you have the proper licensed for Oracle Spatial and Graph) and use parallelism.
Related
I am an absolute beginner in PostgreSQL and PostGIS (databases in general) but have a fairly good working experience in R. I have two multi-polygon data sets of vulnerable areas of India from two different sources - one is around 12gb and it's in .gdb format (let's call it mygdb) and the other is a shapefile around 2gb (let's call it myshp). I want to compare the two sets of vulnerability maps and generate some state-wise measures of fit using intersection (I), difference (D), and union (U) between the maps.
I would like to make use of PostGIS functionalities (via R) as neither R (crashes!) nor qgis (too slow) is efficient for this. To start with, I have uploaded both data sets in my PostGIS database. I used ogr2ogr in R to upload mygdb. But I am kind of stuck at this point. My idea is to split both polygon files by states and then apply other functions to get I, U and D. From my search, I think I can use sf functions like st_split, st_intersect, st_difference, and st_union. However, even after splitting, I would imagine that the file sizes will be still too large for r to process, so my questions are
Is my approach the best way forward?
How can I use sf::st_ functions (e.g. st_split, st_intersection) without importing the data from database into R
There are some useful answers to previous relevant questions, like this one for example. But I find it hard to put the steps together from different links and any help with a dummy example would be great. Many thanks in advance.
Maybe you could try loading it as a stars proxy. It doesn't load the file to the memory, it applies it directly to the hard drive.
https://r-spatial.github.io/stars/articles/stars2.html
Not answer for question sensu stricte, however in response to request in comment, an example of postgresql/postgis query for ST_Intersection. Based on OSM data in postgresql database imported with osm2pgsql:
WITH
highway AS (
select osm_id, way from planet_osm_line where osm_id = 332054927),
dln AS (
select osm_id, way from planet_osm_polygon where "boundary" = 'administrative'
and "admin_level" = '4' and "ref" = 'DS')
SELECT ST_Intersection(dln.way, highway.way) FROM highway, dln
I'm generating a table which will in turn be used to format several different statistics and graphs.
Some columns of this table, are a result of subqueries which use a nearly identical structure. My query works, but it is very inefficient even in a simplified example like the following one.
SELECT
o.order,
o.date,
c.clienttype,
o.producttype,
(SELECT date FROM orders_interactions LEFT JOIN categories WHERE order=o.order AND category=3) as completiondate,
(SELECT amount FROM orders_interactions LEFT JOIN categories WHERE order=o.order AND category=3) as amount,
DATEDIFF((select date from orders_interactions LEFT JOIN categories where order=o.order AND category=3),o.date) as elapseddays
FROM orders o
LEFT JOIN clients c ON c.idClient=o.idClient
Being this a simplified example of a much more complex query, I would like to know the recommended approaches for a query like this one, taking into account query times, and readability.
As the example shows, I had to repeat a subquery (the one with date), just to calculate a datediff, since I cannot directly reference the column 'completiondate'
Thank you
You can try a left join.
SELECT o.order,
o.date,
o.producttype,
oi.date completiondate,
oi.amount,
datediff(oi.date, o.date) completiondate
FROM orders o
LEFT JOIN orders_interactions oi
ON oi.order = o.order
AND oi.category = 3;
That doesn't necessarily perform better but there are good chances. For performance an index on order_interactions (order, category) might help in any case.
And if you consider it more readable is up to you. But at least it's less repetitive (Which doesn't necessarily translates to more performance. Just because an expression is repeated in a query doesn't necessarily mean it repeatedly calculated.)
It seems I might have found the answer.
In my opinion, it improves readability quite a bit, and in my real usage scenario, both profile and execution plans are way more efficient, and results are returned in less than 1/3 of the time.
My answer relies on using a SELECT inside the LEFT JOIN, hence, using a subquery as the JOINs 'input'.
SELECT
o.order,
o.date,
c.clienttype,
o.producttype,
tmp.date,
tmp.amount,
DATEDIFF(tmp.date,o.date) as elapseddays
FROM orders o
LEFT JOIN clients c ON c.idClient=o.idClient
LEFT JOIN (SELECT order,date,amount FROM orders_interactions oi LEFT JOIN categories ct ON ct.order=oi.order AND category=3) AS tmp ON tmp.order=o.order
The answer idea, and the explanation about how and why it works, came from this post: Mysql Reference subquery result in parent where clause
I am using Google dataflow+ Scio to do a cross join of a dataset with itself to find out the topK most similar ones by doing cosine similarity. The data set has around 200k records and total size of the dataset is ~300MB.
I am joining the dataset with itself by passing it as a side input by setting the workerCacheMB to 500MB.
The dataset is a tuple and it looks like this: (String,Set(Integer)). The first element in the tuple is the URI and the next element is a set of entity indexes.
Most records in the dataset have under 500 entity indexes. However, there are about 7000 records which have over 10k entities and the maximum one has 171k entities.
I have some hot keys and hence the worker utitlization looks like this:
After it scaled up to 80 nodes and then scaled down to 1 node, it had already processed about 90% of the records. I assume, the hot keys have got stuck in the last one node and it took the rest of the time to process all the hotkeys.
I tried the --experiments=shuffle_mode=service option. Though it gave an improvement, the problem persists. I was thinking about ways to use the sharded HotKey join mentioned here, how ever since I need to find similarity I don't think I can afford to split the hot entities and rejoin them.
I was wondering if there is a way to solve it or if I basically have to live with this.
Obviously, this is a crude way to find sims. However, I am interested in finding solution to the Data engineering part of the problem, while letting ML engineers iterate on the Sim finding algorithms.
The stripped down version of the code looks as follows:
private def findSimilarities(e1: Set[Integer], e2: Set[Integer]): Float = {
val common = e1.intersect(e2)
val cosine = (common.size.toFloat) / (e1.size + e2.size).toFloat
cosine
}
val topN = sortedReverseTake[ElementSims](250)(by(_.getScore))
elements
.withSideInputs(elementsSI)
.flatMap { case (e1, si) =>
val fromUri = e1._1.toString
val fromEntities = e1._2
val sideInput: List[(String, Set[Integer])] = si(elementsSI)
val sims: List[ElementSims] = findSimilarities(fromUri,fromEntities,
sideInput)
topN(sims)
}
.toSCollection
.saveAsAvroFile(outputGCS)
I have some tables in Hive that I need to join together. Since I need to do some work on each of them, normalize the key, remove outliers.... and as I add more and more tables... This chaining process turned out to be a big mass.
It is so easy to get lost where you are and the query is getting out of control.
However, I have a pretty clear idea how the final table should look like and each column is fairly independent of the other tables.
For examp, here is an example:
table_class1
name id score
Alex 1 90
Chad 3 50
...
table_class2
name id score
Alexandar 1 50
Benjamin 2 100
...
In the end I really want something looks like:
name id class1 class2 ...
alex 1 90 50
ben 2 100 NA
chad 3 50 NA
I know it could be a left outer join, but I am really having a hard time to create a seperate table for each of them after the normalization and then use left outer join with the union of the keys to left outer join each of them...
I am thinking about using NOSQL(HBase) to dump the processed data into NOSQL format.. like:
(source, key, variable, value)
(table_class1, (alex, 1), class1, 90)
(table_class1, (chad, 3), class1, 50)
(table_class2, (alex, 1), class2, 50)
(table_class2, (benjamin, 2), class2, 100)
...
In the end, I want to use something like the melt and cast in R reshape package to bring that data back to be a table.
Since this is a big data project, and there will be hundreds of millions of key value pairs in HBase.
(1) I don't know if this is a legit approach
(2) If so, is there any big data tool to pivot long HBase table into a Hive table.
Honestly, I would love to help more, but I am not clear about what you're trying to achieve (maybe because I've never used R), please elaborate and I'll try to improve my answer if necessary.
Why do you need HBase for? You can store your processed data in new tables and work with them, you can even CREATE VIEW to simplify the query if it's too large, maybe that's what you're looking for (HIVE manual). Unless you have a good reason for using HBase, I'll stick just to HIVE to avoid additional complexity, don't get me wrong, there are a lot of valid reasons for using HBase.
About your second question, you can define and use HBase tables as HIVE tables, you can even CREATE and SELECT INSERT into them all inside HIVE, is that what you're looking for?: HBase/HIVE integration doc
One last thing in case you don't know, you can create custom functions in HIVE very easily to help you with the tedious normalization process, take a look at this.
I am using data.table and there are many functions which require me to set a key (e.g. X[Y]). As such, I wish to understand what a key does in order to properly set keys in my data tables.
One source I read was ?setkey.
setkey() sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.
My takeaway here is that a key would "sort" the data.table, resulting in a very similar effect to order(). However, it doesn't explain the purpose of having a key.
The data.table FAQ 3.2 and 3.3 explains:
3.2 I don't have a key on a large table, but grouping is still really quick. Why is that?
data.table uses radix sorting. This is signicantly faster than other
sort algorithms. Radix is specically for integers only, see
?base::sort.list(x,method="radix"). This is also one reason why
setkey() is quick. When no key is set, or we group in a different order
from that of the key, we call it an ad hoc by.
3.3 Why is grouping by columns in the key faster than an ad hoc by?
Because each group is contiguous in RAM, thereby minimising page
fetches, and memory can be copied in bulk (memcpy in C) rather than
looping in C.
From here, I guess that setting a key somehow allows R to use "radix sorting" over other algorithms, and that's why it is faster.
The 10 minute quick start guide also has a guide on keys.
Keys
Let's start by considering data.frame, specically rownames (or in
English, row names). That is, the multiple names belonging to a single
row. The multiple names belonging to the single row? That is not what
we are used to in a data.frame. We know that each row has at most one
name. A person has at least two names, a rst name and a second name.
That is useful to organise a telephone directory, for example, which
is sorted by surname, then rst name. However, each row in a
data.frame can only have one name.
A key consists of one or more
columns of rownames, which may be integer, factor, character or some
other class, not simply character. Furthermore, the rows are sorted by
the key. Therefore, a data.table can have at most one key, because it
cannot be sorted in more than one way.
Uniqueness is not enforced,
i.e., duplicate key values are allowed. Since the rows are sorted by
the key, any duplicates in the key will appear consecutively
The telephone directory was helpful in understanding what a key is, but it seems that a key is no different when compared to having a factor column. Furthermore, it does not explain why is a key needed (especially to use certain functions) and how to choose the column to set as key. Also, it seems that in a data.table with time as a column, setting any other column as key would probably mess the time column too, which makes it even more confusing as I do not know if I am allowed set any other column as key. Can someone enlighten me please?
In addition to this answer, please refer to the vignettes Secondary indices and auto indexing and Keys and fast binary search based subset as well.
This issue highlights the other vignettes that we plan to.
I've updated this answer again (Feb 2016) in light of the new on= feature that allows ad-hoc joins as well. See history for earlier (outdated) answers.
What exactly does setkey(DT, a, b) do?
It does two things:
reorders the rows of the data.table DT by the column(s) provided (a, b) by reference, always in increasing order.
marks those columns as key columns by setting an attribute called sorted to DT.
The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated).
When is setkey() required?
For grouping operations, setkey() was never an absolute requirement. That is, we can perform a cold-by or adhoc-by.
## "cold" by
require(data.table)
DT <- data.table(x=rep(1:5, each=2), y=1:10)
DT[, mean(y), by=x] # no key is set, order of groups preserved in result
However, prior to v1.9.6, joins of the form x[i] required key to be set on x. With the new on= argument from v1.9.6+, this is not true anymore, and setting keys is therefore not an absolute requirement here as well.
## joins using < v1.9.6
setkey(X, a) # absolutely required
setkey(Y, a) # not absolutely required as long as 'a' is the first column
X[Y]
## joins using v1.9.6+
X[Y, on="a"]
# or if the column names are x_a and y_a respectively
X[Y, on=c("x_a" = "y_a")]
Note that on= argument can be explicitly specified even for keyed joins as well.
The only operation that requires key to be absolutely set is the foverlaps() function. But we are working on some more features which when done would remove this requirement.
So what's the reason for implementing on= argument?
There are quite a few reasons.
It allows to clearly distinguish the operation as an operation involving two data.tables. Just doing X[Y] does not distinguish this as well, although it could be clear by naming the variables appropriately.
It also allows to understand the columns on which the join/subset is being performed immediately by looking at that line of code (and not having to traceback to the corresponding setkey() line).
In operations where columns are added or updated by reference, on= operations are much more performant as it doesn't need the entire data.table to be reordered just to add/update column(s). For example,
## compare
setkey(X, a, b) # why physically reorder X to just add/update a column?
X[Y, col := i.val]
## to
X[Y, col := i.val, on=c("a", "b")]
In the second case, we did not have to reorder. It's not computing the order that's time consuming, but physically reordering the data.table in RAM, and by avoiding it, we retain the original order, and it is also performant.
Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.
This leads to the question, what advantage does keying a data.table have anymore?
Is there an advantage to keying a data.table?
Keying a data.table physically reorders it based on those column(s) in RAM. Computing the order is not usually the time consuming part, rather the reordering itself. However, once we've the data sorted in RAM, the rows belonging to the same group are all contiguous in RAM, and is therefore very cache efficient. It's the sortedness that speeds up operations on keyed data.tables.
It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.
In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.
Question: What do you think would be the performance like in comparison to a keyed join, if you use setorder() to reorder the data.table and use on=? If you've followed thus far, you should be able to figure it out :-).
A key is basically an index into a dataset, which allows for very fast and efficient sort, filter, and join operations. These are probably the best reasons to use data tables instead of data frames (the syntax for using data tables is also much more user friendly, but that has nothing to do with keys).
If you don't understand indexes, consider this: a phone book is "indexed" by name. So if I want to look up someone's phone number, it's pretty straightforward. But suppose I want to search by phone number (e.g., look up who has a particular phone number)? Unless I can "re-index" the phone book by phone number, it will take a very long time.
Consider the following example: suppose I have a table, ZIP, of all the zip codes in the US (>33,000) along with associated information (city, state, population, median income, etc.). If I want to look up the information for a specific zip code, the search (filter) is about 1000 times faster if I setkey(ZIP, zipcode) first.
Another benefit has to do with joins. Suppose a have a list of people and their zip codes in a data table (call it "PPL"), and I want to append information from the ZIP table (e.g. city, state, and so on). The following code will do it:
setkey(ZIP, zipcode)
setkey(PPL, zipcode)
full.info <- PPL[ZIP, nomatch = FALSE]
This is a "join" in the sense that I'm joining the information from 2 tables based in a common field (zipcode). Joins like this on very large tables are extremely slow with data frames, and extremely fast with data tables. In a real-life example I had to do more than 20,000 joins like this on a full table of zip codes. With data tables the script took about 20 min. to run. I didn't even try it with data frames because it would have taken more than 2 weeks.
IMHO you should not just read but study the FAQ and Intro material. It's easier to grasp if you have an actual problem to apply this to.
[Response to #Frank's comment]
Re: sorting vs. indexing - Based on the answer to this question, it appears that setkey(...) does in fact rearrange the columns in the table (e.g., a physical sort), and does not create an index in the database sense. This has some practical implications: for one thing if you set the key in a table with setkey(...) and then change any of the values in the key column, data.table merely declares the table to be no longer sorted (by turning off the sorted attribute); it does not dynamically re-index to maintain the proper sort order (as would happen in a database). Also, "removing the key" using setkey(DT, NULL) does not restore the table to it's original, unsorted order.
Re: filter vs. join - the practical difference is that filtering extracts a subset from a single dataset, whereas join combines data from two datasets based on a common field. There are many different kinds of join (inner, outer, left). The example above is an inner join (only records with keys common to both tables are returned), and this does have many similarities to filtering.