Finding similarity between Arabic text - plsql

I'm currently working on an Oracle 12c database production and on this production on of the tables contain about 2 millions record and the table contain a name column which contain both Arabic and English text. what I'm trying to do is to find a way to analyze the name column to get all the rows with similar to a given name. i tried using the utl_match package with contain implementation for edit_distance and jaro_winkler but this don't work perfectly for Arabic text since there are more similar letters in Arabic which the algorithm treats the as totally different letters like (أ, ا, إ) which results inefficient results. so what I'm looking for now is something to normalize the Arabic text, so i can use it with the utl_match package or any alternative which can help me do the job.
the task should be done in PL/SQl but if it's impossible I'm open to use any other tool or ideas

Use the TRANSLATE function before UTL_MATCH.
For example, the initial edit distance is 2:
select
utl_match.edit_distance
(
s1 => text1,
s2 => text2
) edit_distance
from
(
select
'ليونيكود أاإ' text1,
'ليونيكود ااا' text2
from dual
);
After manually translating similar characters into the exact same character, the edit distance is now 0:
select
utl_match.edit_distance
(
s1 => translate(text1, 'أإ', 'اا'),
s2 => translate(text2, 'أإ', 'اا')
) edit_distance
from
(
select
'ليونيكود أاإ' text1,
'ليونيكود ااا' text2
from dual
);
There might be a better, more official way to compare strings using NLS settings and tools, but if there's only a few characters causing an issue, it's simpler to use TRANSLATE.

Related

Query first letters of each words and full text search in one query

Suppose I have such data:
Lena
Lera
Elena
Mark
Allen
Paul
When user enters 'le' I need to return
Find words by first letters
After that should follow all another words that contains 'le' chars (full text search)
So it should return:
Lena
Lera
Alen
Elena
Something like that (this example does not return anything):
SELECT name FROM table WHERE name LIKE 'le%' AND like '%le%' ORDER BY name ASC
Thank you.
After it turns out that UNION mixes up the order of the sub selects, we'll try something else. You could try it with a custom ORDER BY. First order by whether it starts with your search term, then by name.
SELECT name
FROM table
WHERE UPPER(name) LIKE UPPER('%le%')
ORDER BY (CASE WHEN UPPER(name) LIKE UPPER('le%') THEN 1 ELSE 2 END),
name ASC
Comparing the strings with UPPER helps ignoring the case. But UPPER only support ASCII apparently, read this article for more information on the topic. It also contains other ways to ignore case when comparing strings.

Strings Comparing between result set and correct set

I'm working on an algorithm to extract keywords from a text, I have a test set of scientific abstracts with their tags (keywords) , my question is What is the best way to compare the correct tags with the tags my algorithm produce ?
Should I strictly compare them ex.
if (correct_tag == result_tag)
...or do a similarity check ? Given that sometimes I get something like the following:
For the same document:
**correct_tag** = ["eigenvalues and eigenfunctions in quantum mechanics"]
**result_tag** = ["eigenvalues and eigenfunctions"]
For Another Document:
**correct_tag** = ["cardiovascular system"]
**result_tag** = ["cardiovascular physiology",""cardiovascular system""]
NOTE: These tags are in text tags , meaning they are extracted from the text
Guys any help is appreciated , thanks

How can I prevent SQLite from treating a string as a number?

I would like to query an SQLite table that contains directory paths to find all the paths under some hierarchy. Here's an example of the contents of the column:
/alpha/papa/
/alpha/papa/tango/
/alpha/quebec/
/bravo/papa/
/bravo/papa/uniform/
/charlie/quebec/tango/
If I search for everything under /bravo/papa/, I would like to get:
/bravo/papa/
/bravo/papa/uniform/
I am currently trying to do this like so (see below for the long story of why I can't use more simple methods):
SELECT * FROM Files WHERE Path >= '/bravo/papa/' AND Path < '/bravo/papa0';
This works. It looks a bit weird, but it works for this example. '0' is the unicode code point 1 greater than '/'. When ordered lexicographically, all the paths starting with '/bravo/papa/' compare greater than it and less than 'bravo/papa0'. However, in my tests, I find that this breaks down when we try this:
SELECT * FROM Files WHERE Path >= '/' AND Path < '0';
This returns no results, but it should return every row. As far as I can tell, the problem is that SQLite is treating '0' as a number, not a string. If I use '0Z' instead of '0', for example, I do get results, but I introduce a risk of getting false positives. (For example, if there actually was an entry '0'.)
The simple version of my question is: is there some way to get SQLite to treat '0' in such a query as the length-1 string containing the unicode character '0' (which should sort strings such as '!', '*' and '/', but before '1', '=' and 'A') instead of the integer 0 (which SQLite sorts before all strings)?
I think in this case I can actually get away with special-casing a search for everything under '/', since all my entries will always start with '/', but I'd really like to know how to avoid this sort of thing in general, as it's unpleasantly surprising in all the same ways as Javascript's "==" operator.
First approach
A more natural approach would be to use the LIKE or GLOB operator. For example:
SELECT * FROM Files WHERE Path LIKE #prefix || '%';
But I want to support all valid path characters, so I would need to use ESCAPE for the '_' and '%' symbols. Apparently this prevents SQLite from using an index on Path. (See http://www.sqlite.org/optoverview.html#like_opt ) I really want to be able to benefit from an index here, and it sounds like that's impossible using either LIKE or GLOB unless I can guarantee that none of their special characters will occur in the directory name, and POSIX allows anything other than NUL and '/', even GLOB's '*' and '?' characters.
I'm providing this for context. I'm interested in other approaches to solve the underlying problem, but I'd prefer to accept an answer that directly addresses the ambiguity of strings-that-look-like-numbers in SQLite.
Similar questions
How do I prevent sqlite from evaluating a string as a math expression?
In that question, the values weren't quoted. I get these results even when the values are quoted or passed in as parameters.
EDIT - See my answer below. The column was created with the invalid type "STRING", which SQLite treated as NUMERIC.
* Groan *. The column had NUMERIC affinity because it had accidentally been specified as "STRING" instead of "TEXT". Since SQLite didn't recognize the type name, it made it NUMERIC, and because SQLite doesn't enforce column types, everything else worked as expected, except that any time a number-like string is inserted into that column it is converted into a numeric type.

pl/pgsql merging or combining arrays

How to merge/combine arrays in pl/pgsql?
For example I have an 3 arrays: {1,2,3}, {"a","b","c"}, and {32,43,23}
After merging I need to get:
{{1,"a",32}, {2,"b",43}, {3,"c",23}}
My version of PostgreSQL is 9.0
It sounds like you want an n-argument zip function, as found in some functional languages and languages with functional extensions.
In this case you can't do exactly what yu want, because those arrays are of hetrogeneous types. PostgreSQL arrays must be of homogenous type, so this won't work. The desired result you show is an invalid array.
You could create an array of ROWs (anonymous records), or cast all the values to text.
For example:
SELECT array_agg(ROW(a,b,c))
FROM (
SELECT
unnest('{1,2,3}'::integer[]),
unnest('{"a","b","c"}'::text[]),
unnest('{32,43,23}'::integer[])
)
x(a,b,c);
will produce:
{"(1,a,32)","(2,b,43)","(3,c,23)"}
which is an array of three rowtypes cast to text. It will be awkward to work with because Pg has only very limited support for anonymous records; most importantly in this case you cannot cast a text value to RECORD(integer,text,integer), you must actually CREATE TYPE and cast to the defined type.
Because of that limitation, you may instead want to cast all the values to text and use a two-dimensional array of text. You'd expect to be able to do that with a simple array_agg, but frustratingly this fails:
SELECT array_agg(ARRAY[a::text,b,c::text])
FROM (
SELECT
unnest('{1,2,3}'::integer[]),
unnest('{"a","b","c"}'::text[]),
unnest('{32,43,23}'::integer[])
)
x(a,b,c);
producing:
ERROR: could not find array type for data type text[]
because array_agg doesn't support arrays as input. You need to define another variant of array_agg that takes a text[] input. I wrote one a while ago but can't find it now; I'll try to locate it and update if I find it. In the mean time you can work around it by casting the inner array to text:
SELECT array_agg(ARRAY[a::text,b,c::text]::text)
FROM (
SELECT
unnest('{1,2,3}'::integer[]),
unnest('{"a","b","c"}'::text[]),
unnest('{32,43,23}'::integer[])
)
x(a,b,c);
producing output like:
{"{1,a,32}","{2,b,43}","{3,c,23}"}
... OK, I haven't found the one I wrote, but here's an example from Erwin that does the job fine. Try this:
CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
,STYPE = anyarray
,INITCOND = '{}'
);
SELECT array_agg_mult(ARRAY[ARRAY[a::text,b,c::text]])
FROM (
SELECT
unnest('{1,2,3}'::integer[]),
unnest('{"a","b","c"}'::text[]),
unnest('{32,43,23}'::integer[])
)
x(a,b,c);
Output:
{{1,a,32},{2,b,43},{3,c,23}}

SQLite X'...' notation with column data

I am trying to write a custom report in Spiceworks, which uses SQLite queries. This report will fetch me hard drive serial numbers that are unfortunately stored in a few different ways depending on what version of Windows and WMI were on the machine.
Three common examples (which are enough to get to the actual question) are as follows:
Actual serial number: 5VG95AZF
Hexadecimal string with leading spaces: 2020202057202d44585730354341543934383433
Hexadecimal string with leading zeroes: 3030303030303030313131343330423137454342
The two hex strings are further complicated in that even after they are converted to ASCII representation, each pair of numbers are actually backwards. Here is an example:
3030303030303030313131343330423137454342 evaluates to 00000000111430B17ECB
However, the actual serial number on that hard drive is 1141031BE7BC, without leading zeroes and with the bytes swapped around. According to other questions and answers I have read on this site, this has to do with the "endianness" of the data.
My temporary query so far looks something like this (shortened to only the pertinent section):
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
cast(('X''' || pd.serial || '''') as TEXT)
WHEN pd.serial like "202020%" THEN
LTRIM(X'2020202057202d44585730354341543934383433')
ELSE
pd.serial
END as HDSerial
The result of that query is something like this:
HDModel HDSerial
----------------- -------------------------------------------
Normal Serial 5VG95AZF
202020% test case W -DXW05CAT94843
303030% test case X'3030303030303030313131343330423137454342'
This shows that the X'....' notation style does convert into the correct (but backwards) result of W -DXW05CAT94843 when given a fully literal number (the 202020% line). However, I need to find a way to do the same thing to the actual data in the column, pd.serial, and I can't find a way.
My initial thought was that if I could build a string representation of the X'...' notation, then perhaps cast() would evaluate it. But as you can see, that just ends up spitting out X'3030303030303030313131343330423137454342' instead of the expected 00000000111430B17ECB. This means the concatenation is working correctly, but I can't find a way to evaluate it as hex the same was as in the manual test case.
I have been googling all morning to see if there is just some syntax I am missing, but the closest I have come is this concatenation using the || operator.
EDIT: Ultimately I just want to be able to have a simple case statement in my query like this:
SELECT pd.model as HDModel,
CASE
WHEN pd.serial like "30303030%" THEN
LTRIM(X'pd.serial')
WHEN pd.serial like "202020%" THEN
LTRIM(X'pd.serial')
ELSE
pd.serial
END as HDSerial
But because pd.serial gets wrapped in single quotes, it is taken as a literal string instead of taken as the data contained in that column. My hope was/is that there is just a character or operator I need to specify, like X'$pd.serial' or something.
END EDIT
If I can get past this first hurdle, my next task will be to try and remove the leading zeroes (the way LTRIM eats the leading spaces) and reverse the bytes, but to be honest, I would be content even if that part isn't possible because it wouldn't be hard to post-process this report in Excel to do that.
If anyone can point me in the right direction I would greatly appreciate it! It would obviously be much easier if I was using PHP or something else to do this processing, but because I am trying to have it be an embedded report in Spiceworks, I have to do this all in a single SQLite query.
X'...' is the binary representation in sqlite. If the values are string, you can just use them as such.
This should be a start:
sqlite> select X'3030303030303030313131343330423137454342';
00000000111430B17ECB
sqlite> select ltrim(X'3030303030303030313131343330423137454342','0');
111430B17ECB
I hope this puts you on the right path.

Resources