Denormalize table and append - sqlite

How do I denormalize this table
+---------+---------------------------------------------+
| lemma | definition |
+---------+---------------------------------------------+
| abandon | a feeling of emotional intensity |
| abandon | forsake |
| abandon | leave behind |
| abbey | a church associated with a monastery |
| abbey | a convent ruled by an abbess |
+---------+---------------------------------------------+
as
+---------+-------------------------------------------------------------------------------+
| lemma | definition |
+---------+-------------------------------------------------------------------------------+
| abandon | a feeling of emotional intensity forsake leave behind |
| abbey | a church associated with a monastery a convent ruled by an abbess |
+---------+-------------------------------------------------------------------------------+
What should be the query?

SELECT lemma, GROUP_CONCAT(definition, ' ') AS definition
FROM your_table
GROUP BY lemma
GROUP_CONCAT is one of the aggregate functions available in SQLite

Try this:
SELECT lemma, GROUP_CONCAT(definition,' ')
FROM your_table
GROUP BY lemma
Works in SQLite.
GROUP_CONCAT function
As this link says:
The group_concat() function returns a string which is the
concatenation of all non-NULL values of X. If parameter Y is present
then it is used as the separator between instances of X. A comma (",")
is used as the separator if Y is omitted. The order of the
concatenated elements is arbitrary.

Related

DynamoDB intersection select with pagination

I have following DB schema and I'd like to find the best way how to select list of Sorted keys which are common for PK_A and PK_B:
+---------------+---------+
| PK | SortKey |
+---------------+---------+
| | SK_A |
| PK_A | SK_B |
| | SK_C |
| - - - - - - - | |
| | SK_B |
| PK_B | SK_C |
| | SK_D |
+---------------+---------+
so when I do select by PK_A and PK_B it should return me only SK_B and SK_C?
Any help is appreciated.
Simple answer, you can't do it (in one call).
Dynamo is not a relational database, operations such as intersection are not supported.
You'd need to query() once for each partition key and then calculate the intersect yourself.

SQLite Versioning. Is it possible to use EXCEPT to show differences between rows where only one column changes?

I'm quite new to SQLite and I'm trying to use an EXCEPT statement in order to compare two tables with very similar data. The data comes from a CSV file I download daily, and within the file new rows are added and deleted, and old rows can have one or more columns change each day. I'm trying to find a way to select rows that have had a column's data change, when I am unable to predict which column's data will change.
Say for example I have:
TABLE contracts:
|ID|Description|Name|Contract Type|
|1 |Plumbing |Bob |Paper |
|2 |Cooking |Ryan|Paper |
|3 |Driving |Eric|Paper |
|4 |Dancing |Emma|Paper |
and:
TABLE updated_contracts:
|ID|Description|Name|Contract Type|
|1 |Hiking |Bob |Paper |
|2 |Cooking |Ryan|Paper |
|3 |Driving |Eric|Paper |
|4 |Dancing |Emma|Digital |
I'd like it to return:
|1 |Hiking |Bob |Paper |
|4 |Dancing |Emma|Digital |
because contract 1 has changed the description and contract 4 has changed the contract type.
Is it possible to do this in SQLite?
One way to do it is with a LEFT join of updated_contracts to contracts where the matching rows are filtered out:
select uc.*
from updated_contracts uc left join contracts c
using(id, Description, Name, `Contract Type`)
where c.id is null
EXCEPT can also be used like this:
select * from updated_contracts
except
select * from contracts
This will work only if the tables have the same number of columns and its advantage is that it compares null values in columns and returns true if they are both null.
See the demo.
Results:
| ID | Description | Name | Contract Type |
| --- | ----------- | ---- | ------------- |
| 1 | Hiking | Bob | Paper |
| 4 | Dancing | Emma | Digital |

Get weight of words by occurence

Maybe this is related to math.stacexhange, but I am affraid, that I will get a formula in answer what I won't undersand.
I have products in our database, and I have products from different suppliers in another table.
What I want is to pair, these supplieres products to our products if it is possible, or show for me at least show me a list, where the matching is high.
I did iterate throught all the suppliers products, and explodes the product name by spaces, and store it in a table, and the count of the occurence.
The table seems like this.
+--------+-------------+---------------+-------+
| id | word | originalWord | count |
+--------+-------------+---------------+-------+
| 220950 | Tracer | Tracer | 493 |
| 220951 | Destroyer | Destroyer | 3 |
| 220952 | Avago5050 | Avago5050 | 4 |
| 220953 | mouse | mouse | 2535 |
| 220954 | TRAMYS44916 | /TRAMYS44916/ | 2 |
| 220955 | GameZone | GameZone | 16 |
| 220956 | Enduro | Enduro | 3 |
| 220957 | AVAGO | AVAGO | 10 |
| 220958 | 5050 | 5050 | 4 |
| 220959 | optical | optical | 2370 |
| 220960 | USB | USB | 6160 |
+--------+-------------+---------------+-------+
and so on. Of course, in another table I stored, what is the product id for each word.
So what I want is to determine the weight of a word by occurence.
As you see, the word TRAMYS44916 is occured only twice, almost certain that is a partnumber, so this is the most heavy word. It weight should be 1.
Let's say the most occured is USB with 6160 occurence, so it weight should be like 0.01 or something like that, I think.
What is the best way to get all the weights of the words?
There are other tables for other suppliers so dispersion is always change.
This reminds me of Naive Bayes text classification, so to determine which product should it belongs to, you can calculate tf-idf of all the words.
Then if you want to pair it from another product name, you can decompose it to words again and select the product id based on the highest term value, however maybe you should specify some threshold for this, because in some cases it would not be that clear.
tf-idf = ("number of word matches in product name"/"word count of product name") * log ("number of products" / "number of products that contains the word")
You can see how it is done in the example here (In your case the document will be the product full name): https://en.wikipedia.org/wiki/Tf–idf#Example_of_tf.E2.80.93idf
Example implementation in Java: https://guendouz.wordpress.com/2015/02/17/implementation-of-tf-idf-in-java/

SQLite3: dynamic between query

I have this sqlite3 table (simplified):
+--------+----------+-------+
| ROUTE | WPNumber | WPID |
+--------+----------+-------+
| A123 | 1 | WP001 |
| A123 | 2 | WP002 |
| A123 | 3 | WP003 |
| [...] | [...] | [...] |
| A123 | 20 | WP020 |
+--------+----------+-------+
Lets say I want to travel this route in the reverse direction (020 to 001).
How do I get all the WPID's in between? I know it is possible to build a query using BETWEEN and DESC, but then I'd have to build two seperate queries and have Python check when to use which query. Is it possible to have sqlite3 do the work, independent of the direction (reverse or not).
You can reverse the sorting order by reversing the number used in the ORDER BY clause.
Set the parameter ? to either 1 or -1:
SELECT WPID
FROM ThisTable
WHERE ROUTE = 'A123'
ORDER BY WPNumber * ?
If you would just use a similar query with DESC, the database would have a better opportunity to optimize the sorting with an index.

Creating a unique integer on the basis of a string

I have a larger dataset (data.table with approx 9m rows) with a column that I would like to use to aggregate values (min and max etc). The column is a combination of various other columns and has a string based format, like the one below:
string <- "318XXXX | VNSGN | BIER"
To gain some speed in performing tasks, I would like to recode this to a unique integer. Another application that I use on a regular basis to deal with data has a build-in function that transforms a string as the one above in a integer (e.g. 73823). I was wondering whether there is a similar function in R? The idea is that a particular string will always result in the same integer; this will allow it to be used in merging data.tables etc.
Here a little example of the data.table column that I would like to encode in simple integer values:
sample <- c("318XXXX | VNSGN | BIER", "462XXXX | TZZZH | 9905", "462XXXX | TZZZH | 9905",
"462XXXX | TZZZH | 9905", "511XXXX | FAWOR | 336H", "511XXXX | FAWOR | 336H",
"652XXXX | XXXXR | T136", "652XXXX | XXXXR | T136", "672XXXX | BQQSZ | 7777",
"672XXXX | BQQSZ | 7777")
I am hoping to encode the strings into an additional column to the table like the one below; note that the same strings result in the same numbers.
String Number
318XXXX | VNSGN | BIER 19872
462XXXX | TZZZH | 9905 78392
462XXXX | TZZZH | 9905 78392
462XXXX | TZZZH | 9905 78392
511XXXX | FAWOR | 336H 23053
511XXXX | FAWOR | 336H 23053
652XXXX | XXXXR | T136 95832
652XXXX | XXXXR | T136 95832
672XXXX | BQQSZ | 7777 71829
672XXXX | BQQSZ | 7777 71829
The data.table package will create indexes for you without making you handle them explicitly so it would be less work than the approach in the question. See the setkey function in data.table.
Also the sqldf package can use the SQL create index statement as per Examples 4h and 4i on the sqldf home page as can just about any database package.

Resources