Better "delete rows from table" performance - oracle11g

I have an RDF Graph in Oracle that has approx. 7 ,000, 000 triples(rows)
I have a simple select statement that get's old duplicates (triples) and it deletes them from this RDF Graph.
Now,
let's say my SELECT returns 300 results,
this gets computationally very expensive since the DELETE does a full scan of the TEST_tpl table 300 times and as I said TEST_tpl has approx.
7 ,000, 000 rows...
DELETE FROM TEST_tpl t WHERE t.triple.get_subject()
IN
(
SELECT rdf$stc_sub from rdf_stage_table_TEST
WHERE rdf$stc_pred LIKE '%DateTime%'
)
I am trying to find the way to create an oracle procedure that would go through table only once for multiple values...
Or maybe someone knows of a better way...

The way I solved this is I created an INDEX on triple.get_subject()
CREATE INDEX "SEMANTIC"."TEST_tpl_SUB_IDX"
ON
"SEMANTIC"."TEST_tpl" ("MDSYS"."SDO_RDF_TRIPLE_S"."GET_SUBJECT"("TRIPLE"))
This improved the performance tremendously.
Thank you #Justin Cave and # Michael for your help.

Related

Snowflake, Python/Jupyter analysis

I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!
not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.

MariaDB update query taking a long time

I'm currently having some problems with our mysql replication. We're using a master-master setup for failover purposes.
The replication itself is working and I believe thats setup right. But we're having troubles with some queries that takes a cruciating time to execute.
Example:
| 166 | database | Connect | 35 | updating | update xx set xx =
'xx' where xx = 'xx' and xx = 'xx' | 0.000 |
These update queries are taking a 20-30+ seconds sometimes to complete and because of that the replication starts lagging behind and within a day, it will be behind for a couple of hours. Strange part is that it will eventually catchup with the other master.
The table is around ~100MM rows big and around 70GB large. On the master where the queries are executed they take less than a second.
Both configurations, mysql and server, are near identical and we tried optimizing the table and queries, but no luck so far.
Any recommendations we could try to solve this? Let me know if I can provide you with any more information.
Using:
MariaDB 10.1.35 -
CentOS 7.5.1804
The key aspect of this is how many rows are you updating:
If the percentage is low (less than 5% of the rows) then an index can help.
Otherwise, if you are updating a large number of rows (greater than 5%), a full table scan will be optimal. If you have millions of rows this will be slow. Maybe partitioning the table could help, but I would say you have little chances of improving it.
I'm going to assume you are updating a small percentage of rows, so you can use an index. Look at the condition in the WHERE statement. If it looks like this:
WHERE col1 = 'xx' and col2 = 'yy'
Then, an index on those columns will make your query faster. Specifically:
create index ix1 on my_table (col1, col2);
Depending on the selectivity of your columns the flipped index could be faster:
create index ix2 on my_table (col2, col1);
You'll need to try which one is better for your specific case.

dplyr Filter Database Table with Large Number of Matches

I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.

HBase keyvalue (NOSQL) to Hive table (SQL)

I have some tables in Hive that I need to join together. Since I need to do some work on each of them, normalize the key, remove outliers.... and as I add more and more tables... This chaining process turned out to be a big mass.
It is so easy to get lost where you are and the query is getting out of control.
However, I have a pretty clear idea how the final table should look like and each column is fairly independent of the other tables.
For examp, here is an example:
table_class1
name id score
Alex 1 90
Chad 3 50
...
table_class2
name id score
Alexandar 1 50
Benjamin 2 100
...
In the end I really want something looks like:
name id class1 class2 ...
alex 1 90 50
ben 2 100 NA
chad 3 50 NA
I know it could be a left outer join, but I am really having a hard time to create a seperate table for each of them after the normalization and then use left outer join with the union of the keys to left outer join each of them...
I am thinking about using NOSQL(HBase) to dump the processed data into NOSQL format.. like:
(source, key, variable, value)
(table_class1, (alex, 1), class1, 90)
(table_class1, (chad, 3), class1, 50)
(table_class2, (alex, 1), class2, 50)
(table_class2, (benjamin, 2), class2, 100)
...
In the end, I want to use something like the melt and cast in R reshape package to bring that data back to be a table.
Since this is a big data project, and there will be hundreds of millions of key value pairs in HBase.
(1) I don't know if this is a legit approach
(2) If so, is there any big data tool to pivot long HBase table into a Hive table.
Honestly, I would love to help more, but I am not clear about what you're trying to achieve (maybe because I've never used R), please elaborate and I'll try to improve my answer if necessary.
Why do you need HBase for? You can store your processed data in new tables and work with them, you can even CREATE VIEW to simplify the query if it's too large, maybe that's what you're looking for (HIVE manual). Unless you have a good reason for using HBase, I'll stick just to HIVE to avoid additional complexity, don't get me wrong, there are a lot of valid reasons for using HBase.
About your second question, you can define and use HBase tables as HIVE tables, you can even CREATE and SELECT INSERT into them all inside HIVE, is that what you're looking for?: HBase/HIVE integration doc
One last thing in case you don't know, you can create custom functions in HIVE very easily to help you with the tedious normalization process, take a look at this.

RSQLite Faster Subsetting of large Table?

So I have a large dataset (see my previous question) where I need to subset it based on an ID which I have in another table
I use a statement like:
vars <- dbListFields(db, "UNIVERSE")
ids <- dbGetQuery(db, "SELECT ID FROM LIST1"
dbGetQuery(db,
paste("CREATE TABLE SUB1 (",
paste(vars,collapse=" int,"),
")"
) )
dbGetQuery(db,
paste("INSERT INTO SUB1 (",
paste(vars,collapse=","),
") SELECT * FROM UNIVERSE WHERE
UNIVERSE.ID IN (",
paste(t(ids),collapse=","),
")"
) )
The code runs (I may have missed a parenthesis above) but it takes a while since my table UNIVERSE is about 10 gigs in size. The major problem is I'm going to have to run this for many different tables "LIST#" to make "SUB#" and the subsets are not disjoint so I can't just delete the record from UNIVERSE when I'm done with it.
I'm wondering if I've gone about subsetting the wrong way or if there's other ways I can speed this up?
Thanks for the help.
This is kind of an old question and I don't know if you found the solution or not. If UNIVERSE.ID is a unique, non-NULL integer, setting it up as an 'INTEGER PRIMARY KEY' should speed things up a lot. There's some code and discussion here:
http://www.mail-archive.com/r-sig-db%40stat.math.ethz.ch/msg00363.html
I don't know if using an inner join would speed things up or not; it might be worth a try too.
Do you have an index on UNIVERSE.ID? I'm no SQLite guru, but generally you want fields that you are going to query on to have indexes.

Resources