I'm working on a solution for Cassandra that's proving impossible.
We have a table that will return a set of candidates given some search criteria. The row with the highest score is returned back to the user. We can do this quite easily with SQL, but there's a need to migrate to Cassandra. Here are the tables involved:
Value
ID | VALUE | COUNTRY | STATE | CITY | COUNTY
--------+---------+----------+----------+-----------+-----------
1 | 50 | US | | |
--------+---------+----------+----------+-----------+-----------
2 | 25 | | TX | |
--------+---------+----------+----------+-----------+-----------
3 | 15 | | | MEMPHIS |
--------+---------+----------+----------+-----------+-----------
4 | 5 | | | | BROWARD
--------+---------+----------+----------+-----------+-----------
5 | 30 | | NY | NYC |
--------+---------+----------+----------+-----------+-----------
6 | 20 | US | | NASHVILLE |
--------+---------+----------+----------+-----------+-----------
Scoring
ATTRIBUTE | SCORE
-------------+-------------
COUNTRY | 1
STATE | 2
CITY | 4
COUNTY | 8
A query is sent that can have any of those four attributes populated or not. We search through our values table, calculate the scores, and return the highest one. If a column in the values table is null, it means it's applicable for all.
ID 1 is applicable for all states, cities, and counties within the US.
ID 2 is applicable for all countries, cities, and counties where the state is TX.
Example:
Query: {Country: US, State: TX}
Matches Value IDs: [1, 2, 3, 4, 6]
Scores: [1, 2, 4, 8, 5(1+4)]
Result: {id: 4} (8 was the highest score so Broward returns)
How would you model something like this in Cassandra 2.1?
Found out the best way to achieve this was using Solr with Cassandra.
Somethings to note though about using Solr, since all the resources I needed were scattered amongst the internet.
You must first start Cassandra with Solr. There's a command with the dse tool for starting cassandra with Solr enabled.
$CASSANDRA_HOME/bin/dse cassandra -s
You must create your keyspace with network topology stategy and solr enabled.
CREATE KEYSPACE ... WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'Solr': 1}
Once you create your table within your solr enabled keyspace, create a core using the dsetool.
$CASSANDRA_HOME/bin/dsetool create_core keyspace.table_name generateResources=true reindex=true
This will allow solr to index your data and generate a number of secondary indexes against your cassandra table.
To perform the queries needed for columns where values may or may not exist requires a somewhat complex query.
SELECT * FROM keyspace.table_name WHERE solr_query = '{"q": "{(-column:[* TO *] AND *:*) OR column:value}"';
Finally, you may notice when searching for text, your solr query column:"Hello" may pick up other unwanted values like HelloWorld or HelloThere. This is due to the datatype used in your schema.xml for Solr. Here's how to modify this behavior:
Head to your Solr Admin UI. (Normally http://hostname:8983/solr/)
Choose your core in the drop down list in the left pane, should be named keyspace.table_name.
Look for Config or Schema, both should take you to the schema.xml.
Copy and paste that file to some text editor. Optionally, you could try using wget or curl to download the file, but you need the real link which is provided in the text field box to the top right.
There's a tag <fieldtype>, with the name TextField. Replace org.apache.solr.schema.TextField with org.apache.solr.schema.StrField. You must also remove the analyzers, StrField does not support those.
That's it, hopefully I've saved people from all the headaches I encountered.
Related
I am newer to KQL and I am trying to write a query against configuration changes made to files with an extension of ".config" and would like to remove results that are generated under the "TimeGenerated [UTC]" column. The results should exclude Thursday's from midnight- 2am EST. Would someone be able to assist me in writing this? Not sure how to write it up as to have it return the results that exclude the specific time frame. Below is what I have so far:
ConfigurationChange
| where ConfigChangeType in("Files")
| where FileSystemPath contains ".config"
| sort by TimeGenerated
| render table
Try adding the following filter, using dayofweek() and hourofday().
Note: The example below works in UTC. You can add the current offset of UTC -> EST to TimeGenerated as part of the filter
ConfigurationChange
| where dayofweek(TimeGenerated) != 6d and hourofday(TimeGenerated) !in(0, 1) // <---
| where ConfigChangeType in ("Files")
| where FileSystemPath contains ".config"
| sort by TimeGenerated
| render table
Is there an efficient way to split on sqlite Database into multiple files by one of their columns? So for example, if I start with one database which has:
full.db:
event_id|type
--------+----
8896631 | a
8896632 | b
8896633 | c
8896634 | b
8896635 | a
8896636 | a
I want to end up with 3 new DB files:
a.db:
event_id|type
--------+----
8896631 | a
8896635 | a
8896636 | a
b.db:
event_id|type
--------+----
8896632 | b
8896634 | b
c.db:
event_id|type
--------+----
8896633 | c
A little more backstory: I have a 1TB sqlite3 file stored on a cluster's network filesystem and each of some 40,000 different tasks across about 1000 nodes are trying to access it. This is not scaling well, so I'm wondering if I can do a pre-processing step to break this giant sqlite file into smaller pieces so that each task only needs access to one of these pieces.
We have a WordPress website using MariaDB where a wp_options table keeps growing due to a rogue plugin writing thousands of records to the table. The issue has not been resolved by the plugin maintainer yet and I keep having to remove these 'transient' (temp) records manually via DELETE statement. The problem is the ibd file keeps growing and now 35GB in size. Once this is resolved, I plan to do an OPTIMIZE TABLE on the table to cleanup. Is that the best approach to reclaim all that space? I assume I'll need as much as 40GB free space to do this and how long should the OPTIMIZE TABLE take? Since this table is used quite a bit by WordPress, it seems it will be best to take the website offline while optimizing to avoid locks. I'll looking for the quickest way to resolve.
At least I think these rogue records are the cause of the table growing. Below is a list of the top 10 type of entries in the table:
MariaDB [wmnf_www]> SELECT substr(`wp_options`.`option_name`, 1, 18) AS `option_name`, count(`wp_options`.`option_value`) AS `cnt` FROM `wp_options` GROUP BY substr(`wp_options`.`option_name`, 1, 18) ORDER BY `cnt` DESC LIMIT 10;
+--------------------+-------+
| option_name | cnt |
+--------------------+-------+
| _transient_timeout | 21186 |
| _transient_ee_ssn_ | 12628 |
| _transient_jpp_li_ | 222 |
| _transient_externa | 125 |
| _transient_wc_rela | 63 |
| jpsq_sync-14716436 | 50 |
| wpmf_current_folde | 35 |
| _wc_session_expire | 34 |
| jpsq_sync-14716465 | 29 |
| jpsq_sync-14716417 | 25 |
+--------------------+-------+
10 rows in set (0.17 sec)
The _transient_ee_ssn_ and _transient_timeout_ee_ are the issue and keep growing, the only ones in the set above that has grown since last night and was initially found with 800K records. I keep removing the records as the plugin maintainer said was safe. But is this the cause of the ibd file growing?
---UPDATE---
Oddly enough, the issue is not resolved and transient records keep getting generated by the thousands, but this ibd index file has stopped growing for the moment. After steadily growing over the weekend from 20GB to now 39GB, it has not grown in a couple of hours. Perhaps there's a limit or this file was growing for other reasons?
I think it would be a better solution to recreate the table using Percona pt-online-schema-change tool. This will recreate the table and move all the data to the new table then drop the old table. This will avoid locking the database for a long time.
I'm trying to pull and sum data from one sheet on another. This is GA data being built into a report, so I have sessions split up by landing page and device type, and would like to group them in different ways.
I usually use FILTER() for this sort of thing, but it keeps returning a 0 sum. Thinking this may be an odd edge case with FILTER(), I switched to using QUERY() instead. That gave me an error, but a Google search doesn't offer much documentation about what the error actually means. Taking a guess that it could be indicating an issue with the data type (i.e. not numeric), I changed the format of the source from "Automatic" to "Number", but to no avail.
Maybe it's a lack of coffee, I'm at a loss as to why neither function is working to do a simple lookup and sum by criteria.
FILTER() function
SUM(FILTER(AllData!C:C,AllData!A:A="/chestnut/",AllData!B:B="desktop"))
No error, but returns 0 regardless of filter parameters.
QUERY() function
QUERY(AllData!A:G, "SELECT SUM(C) WHERE A='/chestnut/' AND B='desktop'",1)
Error returned:
Unable to parse query string for Function QUERY parameter 2: AVG_SUM_ONLY_NUMERIC
Sample data:
landingPage | deviceCategory | sessions
-------------|----------------|----------
/chestnut/ | desktop | 4
/chestnut/ | desktop | 2
/chestnut/ | tablet | 5
/chestnut/ | tablet | 1
/maple/ | desktop | 1
/maple/ | desktop | 2
/maple/ | mobile | 3
/maple/ | mobile | 1
I think the summing doesn't work because your numbers are text formatted.
See if any of these work? (change ranges to suit)
using FILTER()
=SUM(FILTER(VALUE(AllData!C:C),AllData!A:A="/chestnut/",AllData!B:B="desktop"))
using QUERY()
=ArrayFormula(QUERY({AllData!A:B, VALUE(AllData!C:C)}, "SELECT SUM(Col3) WHERE Col1='/chestnut/' AND Col2='desktop' label SUM(Col3)''",1))
using SUMPRODUCT()
=SUMPRODUCT(VALUE(AllData!C2:C),AllData!A2:A="/chestnut/",AllData!B2:B="desktop")
I come from a php/mysql background, and json data and firebase queries are still pretty new to me. Now I am working in React Native and I have a collection of data, and one of the keys stores a integer. I want to get the average of all the integers. Where do I even start?
I have used Firebases snapshot.numChildren function before so I am getting a little more familiar in this json world, but any sort of help would be appreciated. Thanks!
So, you know that you can return all of the data and determine the average. I'm guessing this is for a large set of data where it would be ideal not to return the entire node every time you would like to retrieve and average.
It depends on what this data is and how it's being updated, but I think one option is to simply have a separate node that is updated every time the collection is added to or is changed.
Here is some really rough pseudo code. For example, if your database looks like this:
database
|
+--collection
| |
| +--item_one (probably a uid like -k2jduwi5j5j5)
| | |
| | +--number: 90
| |
| +--item_two
| | |
| | +--number: 70
|
+--collection_metadata
| |
| +--average: 80
| |
| +--number_of_items: 2
Then when a new item is added, you run a metadata calculation:
var numerator = average * number_of_items + newItem.number;
number_of_items++; <-- this is your new number of items
numerator / number_of_items; <-- this is your new average
Then when an item is updated, you run a metadata calculation:
var numerator = average * number_of_items - changedItem.oldNumber + changedItem.newNumber;
numerator / number_of_items; <-- this is your new average
Now when you want this data, you always have this data on hand.