Is leftsemi faster with smaller table on the left side? - azure-data-explorer

Join operator documentation says:
Tip
For best performance, if one table is always smaller than the
other, use it as the left (piped) side of the join.
The purpose of leftsemi in most cases is to filter a bigger set on the left by a smaller set on the right. Is the quote above still applicable to leftsemi flavor of join operator?

At least at this point, Tables' order does matter.
Here is a quick test results, executed on my dev cluster:
Setup
.set-or-replace L100M <| range i from 1 to 100000000 step 1
.set-or-replace S1M <| range i from 1 to 1000000 step 1 | project i = tolong(rand(100000000))
rightsemi (small table first)
S1M | join kind=rightsemi L100M on i | consume
Query completes in around 3 seconds
leftsemi (large table first)
L100M | join kind=leftsemi S1M on i | consume
Query runs about 20 seconds and then fails with the following exception:
Query execution lacks memory resources to complete (80DA0007): Partial
query failure: Low memory condition (E_LOW_MEMORY_CONDITION).
(message: 'bad allocation', details: '').

Related

Sqoop trying to --split-by ROWID (Oracle) fails

(be Kind, this is my first question and I did extensive Research here and on the net beforehand. Question Oracle ROWID for Sqoop Split-By Column did not really solve this issue, as the original Person asking resorted to using another column)
I am using sqoop to copy data from an Oracle 11 DB.
Unfortunately, some tables have no index, no Primary key, only partitions (date). These tables are very large, hundreds of millions if not billions of rows.
so far, I have decided to Access data in the source by explicitly adressing the partitions. That works well and Speeds up the process nicely.
I need to do the splits by data that resides in each and every table in order to avoid too many if- branches in my bash script. (we're talking some 200+ tables here)
I notice that a split by 8 Tasks results in very uneven spread of workload among the Tasks. I considered using Oracle ROWID to define the split.
To do this, I must define a boundary-query. In a Standard query 'select * from xyz' the rowid is not part of the result set. therefore, it is not an option to let Sqoop define the boundary-query from --query.
Now, when I run this, I am getting the error
ERROR tool.ImportTool: Encountered IOException running import job:
java.io.IOException: Sqoop does not have the splitter for the given SQL
data type. Please use either different split column (argument --split-by)
or lower the number of mappers to 1. Unknown SQL data type: -8
samples of ROWID :
AAJXFWAKPAAOqqKAAA
AAJXFWAKPAAOqqKAA+
AAJXFWAKPAAOqqKAA/
it is static and unique once it is created for any row.
I cast this funny datatype into something else in my boundary-query
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect
jdbc:oracle:thin:#127.0.0.1:port:mydb --username $USER --P --m 8
--split-by ROWID --boundary-query "select cast(min(ROWID) as varchar(18)), cast
( max(ROWID)as varchar(18)) from table where laufbzdt >
TO_DATE('2019-02-27', 'YYYY-MM-DD')" --query "select * from table
where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD') and \$CONDITIONS "
--null-string '\\N'
--null-non-string '\\N'
But then I get ugly ROWIDs that are rejected by Oracle:
select * from table where laufbzdt > TO_DATE('2019-02-27', 'YYYY-MM-DD')
and ( ROWID >= 'AAJX6oAG聕聁AE聉N:' ) AND ( ROWID < 'AAJX6oAH⁖⁁AD䁔䀷' ) ,
Error Msg = ORA-01410: invalid ROWID
how can I resolve this properly?
I am a LINUX-Embryo and have painfully chewed myself through the Topics of bash-shell-scripting and Sqooping so far, but I would like to make better use of evenly spread mapper-task workload - it would cut sqoop-time in half, I guess, saving some 5 to 8 hours.
TIA!
wahlium
You can try ROWNUM, but I think sqoop import does not work with pseudocolumn.

Google Sheets FILTER() and QUERY() not working with SUM()

I'm trying to pull and sum data from one sheet on another. This is GA data being built into a report, so I have sessions split up by landing page and device type, and would like to group them in different ways.
I usually use FILTER() for this sort of thing, but it keeps returning a 0 sum. Thinking this may be an odd edge case with FILTER(), I switched to using QUERY() instead. That gave me an error, but a Google search doesn't offer much documentation about what the error actually means. Taking a guess that it could be indicating an issue with the data type (i.e. not numeric), I changed the format of the source from "Automatic" to "Number", but to no avail.
Maybe it's a lack of coffee, I'm at a loss as to why neither function is working to do a simple lookup and sum by criteria.
FILTER() function
SUM(FILTER(AllData!C:C,AllData!A:A="/chestnut/",AllData!B:B="desktop"))
No error, but returns 0 regardless of filter parameters.
QUERY() function
QUERY(AllData!A:G, "SELECT SUM(C) WHERE A='/chestnut/' AND B='desktop'",1)
Error returned:
Unable to parse query string for Function QUERY parameter 2: AVG_SUM_ONLY_NUMERIC
Sample data:
landingPage | deviceCategory | sessions
-------------|----------------|----------
/chestnut/ | desktop | 4
/chestnut/ | desktop | 2
/chestnut/ | tablet | 5
/chestnut/ | tablet | 1
/maple/ | desktop | 1
/maple/ | desktop | 2
/maple/ | mobile | 3
/maple/ | mobile | 1
I think the summing doesn't work because your numbers are text formatted.
See if any of these work? (change ranges to suit)
using FILTER()
=SUM(FILTER(VALUE(AllData!C:C),AllData!A:A="/chestnut/",AllData!B:B="desktop"))
using QUERY()
=ArrayFormula(QUERY({AllData!A:B, VALUE(AllData!C:C)}, "SELECT SUM(Col3) WHERE Col1='/chestnut/' AND Col2='desktop' label SUM(Col3)''",1))
using SUMPRODUCT()
=SUMPRODUCT(VALUE(AllData!C2:C),AllData!A2:A="/chestnut/",AllData!B2:B="desktop")

Required Appropriate query to find out the result

I need a desired result with the less number of execution time.
I have a table which contains many rows (over 100k) , in this table a field name is notes varchar2(1800).
It contains following values:
notes
CASE Transfer
Surnames AAA : BBBB
Case Status ACCOUNT TXFERRED TO BORROWERS
Completed Date 25/09/2022
Task Group 16
Message sent at 12/10/2012 11:11:21
Sender : lynxfailures123#google.com
Recipient : LFRB568767#yahoo.com
Received : 21:31 12/12/2002
Rows should return with the values of(ACCOUNT TXFERRED TO BORROWERS).
I have used the following queries but it takes a long time(72150436 sec) to execute:
Select * from cps_case_history where (dbms_lob.instr(notes, 'ACCOUNT
TFR TO UFSS') > 1)
Select * from cps_case_history where notes like '%ACCOUNT TFR TO
UFSS%'
Could you please share us the exact query which will take less time to execute.
Can you try parallel hints. Optimizer hints
Select /*+ PARALLEL(a,8) */ a.* from cps_case_history a
where INSTR(NOTES,'Text you want to search') > 0; -- your condition
Replace 8 with 16 and see if the performance improves further.
Avoid % in beginning of the like operator
ie., where notes like '%Account...'
Updated answer : Try creating partition tables.You can go with range partitioning on completed_date column Partitioning

Postgresql partitions: Abnormally high seq scan cost on master table

I have a little database of a few hundreds of millions of rows for storing call detail records. I setup partitioning as per:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
and it seemed to work pretty well until now. I have master table "acmecdr" which has rules for inserting into the correct partition and check constraints to make sure the correct table is used when selecting data. Here is an example of one of the partitions:
cdrs=> \d acmecdr_20130811
Table "public.acmecdr_20130811"
Column | Type | Modifiers
-------------------------------+---------+------------------------------------------------------
acmecdr_id | bigint | not null default
...snip...
h323setuptime | bigint |
acmesipstatus | integer |
acctuniquesessionid | text |
customers_id | integer |
Indexes:
"acmecdr_20130811_acmesessionegressrealm_idx" btree (acmesessionegressrealm)
"acmecdr_20130811_acmesessioningressrealm_idx" btree (acmesessioningressrealm)
"acmecdr_20130811_calledstationid_idx" btree (calledstationid)
"acmecdr_20130811_callingstationid_idx" btree (callingstationid)
"acmecdr_20130811_h323setuptime_idx" btree (h323setuptime)
Check constraints:
"acmecdr_20130811_h323setuptime_check" CHECK (h323setuptime >= 1376179200 AND h323setuptime < 1376265600)
Inherits: acmecdr
Now as one would expect with SET constraint_exclusion = on the correct partition should automatically be preferred and since there is an index on it there should only be one index scan.
However:
cdrs=> explain analyze select * from acmecdr where h323setuptime > 1376179210 and h323setuptime < 1376179400;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.00..1435884.93 rows=94 width=1130) (actual time=138857.660..138858.778 rows=112 loops=1)
-> Append (cost=0.00..1435884.93 rows=94 width=1130) (actual time=138857.628..138858.189 rows=112 loops=1)
-> Seq Scan on acmecdr (cost=0.00..1435863.60 rows=1 width=1137) (actual time=138857.584..138857.584 rows=0 loops=1)
Filter: ((h323setuptime > 1376179210) AND (h323setuptime < 1376179400))
-> Index Scan using acmecdr_20130811_h323setuptime_idx on acmecdr_20130811 acmecdr (cost=0.00..21.33 rows=93 width=1130) (actual time=0.037..0.283 rows=112 loops=1)
Index Cond: ((h323setuptime > 1376179210) AND (h323setuptime < 1376179400))
Total runtime: 138859.240 ms
(7 rows)
So, I can see it's not scanning all the partitions, only the relevant one (which in index scan and pretty quick) and also the master table (which seems to be normal from the examples I've seen). But the high cost of the seq scan on the master table seems to be abnormal. I would love for that to come down and I see no reason for it, especially since the master table does not have any records in it:
cdrs=> select count(*) from only acmecdr;
count
-------
0
(1 row)
Unless I'm missing something obvious, this query should be quick. But it's not - it takes about 2 minutes? This does not seem normal at all (even for a slow server).
I'm out of ideas of what to try next, so if anyone has any suggestions or pointers in the right direction, it would be very much appreciated.

Downloading the entire Bitcoin transaction chain with R

I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables
1) A full list of all Bitcoin addresses and their balance, i.e.,:
| ID | Address | Balance |
-------------------------------
| 1 | 7d4kExk... | 32 |
| 2 | 9Eckjes... | 0 |
| . | ... | ... |
2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network
| ID | Sender | Receiver | Transactions |
--------------------------------------------------
| 1 | 7d4kExk... | klDk39D... | 2 |
| 2 | 9Eckjes... | 7d4kExk... | 3 |
| . | ... | ... | .. |
To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues
1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going
2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.
I'd really appreaciate your help in two areas:
1 - Is there a better way to do this in R to make the code run significantly faster?
2 - Should I stop using R altogether for this and try a different approach?
Thanks in advance for the help! Please see below for the relevant chunks of code I'm using
url_start <- "http://blockexplorer.com/b/"
url_end <- ""
readUrl <- function(url) {
table <- try(readHTMLTable(url)[[1]])
if(inherits(table,"try-error")){
message(paste("URL does not seem to exist:", url))
errors <- errors + 1
return(NA)
} else {
processed <- processed + 1
return(table)
}
}
block_loop <- function (end, start = 0) {
...
addr_row <- 1 #starting row to fill out table
links_row <- 1 #starting row to fill out table
for (i in start:end) {
print(paste0("Reading block: ",i))
url <- paste(url_start,i,url_end, sep = "")
table <- readUrl(url)
if(is.na(table)){ next }
....
There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:
curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000
I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.
Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).
Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.
The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.
If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.

Resources