Is there an efficient way to split on sqlite Database into multiple files by one of their columns? So for example, if I start with one database which has:
full.db:
event_id|type
--------+----
8896631 | a
8896632 | b
8896633 | c
8896634 | b
8896635 | a
8896636 | a
I want to end up with 3 new DB files:
a.db:
event_id|type
--------+----
8896631 | a
8896635 | a
8896636 | a
b.db:
event_id|type
--------+----
8896632 | b
8896634 | b
c.db:
event_id|type
--------+----
8896633 | c
A little more backstory: I have a 1TB sqlite3 file stored on a cluster's network filesystem and each of some 40,000 different tasks across about 1000 nodes are trying to access it. This is not scaling well, so I'm wondering if I can do a pre-processing step to break this giant sqlite file into smaller pieces so that each task only needs access to one of these pieces.
Related
Suppose I have a table, I run a query like below :-
let data = orders | where zip == "11413" | project timestamp, name, amount ;
inject data into newOrdersInfoTable in another cluster // ==> How can i achieve this?
There are many ways to do it. If it is a manual task and with not too much data you can simply do something like this in the target cluster:
.set-or-append TargetTable <| cluster("put here the source cluster url").database("put here the source database").orders
| where zip == "11413" | project timestamp, name, amount
Note that if the dataset is larger you can use the "async" flavor of this command. If the data size is bigger then you should consider exporting the data and importing it to the other cluster.
We have a WordPress website using MariaDB where a wp_options table keeps growing due to a rogue plugin writing thousands of records to the table. The issue has not been resolved by the plugin maintainer yet and I keep having to remove these 'transient' (temp) records manually via DELETE statement. The problem is the ibd file keeps growing and now 35GB in size. Once this is resolved, I plan to do an OPTIMIZE TABLE on the table to cleanup. Is that the best approach to reclaim all that space? I assume I'll need as much as 40GB free space to do this and how long should the OPTIMIZE TABLE take? Since this table is used quite a bit by WordPress, it seems it will be best to take the website offline while optimizing to avoid locks. I'll looking for the quickest way to resolve.
At least I think these rogue records are the cause of the table growing. Below is a list of the top 10 type of entries in the table:
MariaDB [wmnf_www]> SELECT substr(`wp_options`.`option_name`, 1, 18) AS `option_name`, count(`wp_options`.`option_value`) AS `cnt` FROM `wp_options` GROUP BY substr(`wp_options`.`option_name`, 1, 18) ORDER BY `cnt` DESC LIMIT 10;
+--------------------+-------+
| option_name | cnt |
+--------------------+-------+
| _transient_timeout | 21186 |
| _transient_ee_ssn_ | 12628 |
| _transient_jpp_li_ | 222 |
| _transient_externa | 125 |
| _transient_wc_rela | 63 |
| jpsq_sync-14716436 | 50 |
| wpmf_current_folde | 35 |
| _wc_session_expire | 34 |
| jpsq_sync-14716465 | 29 |
| jpsq_sync-14716417 | 25 |
+--------------------+-------+
10 rows in set (0.17 sec)
The _transient_ee_ssn_ and _transient_timeout_ee_ are the issue and keep growing, the only ones in the set above that has grown since last night and was initially found with 800K records. I keep removing the records as the plugin maintainer said was safe. But is this the cause of the ibd file growing?
---UPDATE---
Oddly enough, the issue is not resolved and transient records keep getting generated by the thousands, but this ibd index file has stopped growing for the moment. After steadily growing over the weekend from 20GB to now 39GB, it has not grown in a couple of hours. Perhaps there's a limit or this file was growing for other reasons?
I think it would be a better solution to recreate the table using Percona pt-online-schema-change tool. This will recreate the table and move all the data to the new table then drop the old table. This will avoid locking the database for a long time.
I come from a php/mysql background, and json data and firebase queries are still pretty new to me. Now I am working in React Native and I have a collection of data, and one of the keys stores a integer. I want to get the average of all the integers. Where do I even start?
I have used Firebases snapshot.numChildren function before so I am getting a little more familiar in this json world, but any sort of help would be appreciated. Thanks!
So, you know that you can return all of the data and determine the average. I'm guessing this is for a large set of data where it would be ideal not to return the entire node every time you would like to retrieve and average.
It depends on what this data is and how it's being updated, but I think one option is to simply have a separate node that is updated every time the collection is added to or is changed.
Here is some really rough pseudo code. For example, if your database looks like this:
database
|
+--collection
| |
| +--item_one (probably a uid like -k2jduwi5j5j5)
| | |
| | +--number: 90
| |
| +--item_two
| | |
| | +--number: 70
|
+--collection_metadata
| |
| +--average: 80
| |
| +--number_of_items: 2
Then when a new item is added, you run a metadata calculation:
var numerator = average * number_of_items + newItem.number;
number_of_items++; <-- this is your new number of items
numerator / number_of_items; <-- this is your new average
Then when an item is updated, you run a metadata calculation:
var numerator = average * number_of_items - changedItem.oldNumber + changedItem.newNumber;
numerator / number_of_items; <-- this is your new average
Now when you want this data, you always have this data on hand.
I'm working on a solution for Cassandra that's proving impossible.
We have a table that will return a set of candidates given some search criteria. The row with the highest score is returned back to the user. We can do this quite easily with SQL, but there's a need to migrate to Cassandra. Here are the tables involved:
Value
ID | VALUE | COUNTRY | STATE | CITY | COUNTY
--------+---------+----------+----------+-----------+-----------
1 | 50 | US | | |
--------+---------+----------+----------+-----------+-----------
2 | 25 | | TX | |
--------+---------+----------+----------+-----------+-----------
3 | 15 | | | MEMPHIS |
--------+---------+----------+----------+-----------+-----------
4 | 5 | | | | BROWARD
--------+---------+----------+----------+-----------+-----------
5 | 30 | | NY | NYC |
--------+---------+----------+----------+-----------+-----------
6 | 20 | US | | NASHVILLE |
--------+---------+----------+----------+-----------+-----------
Scoring
ATTRIBUTE | SCORE
-------------+-------------
COUNTRY | 1
STATE | 2
CITY | 4
COUNTY | 8
A query is sent that can have any of those four attributes populated or not. We search through our values table, calculate the scores, and return the highest one. If a column in the values table is null, it means it's applicable for all.
ID 1 is applicable for all states, cities, and counties within the US.
ID 2 is applicable for all countries, cities, and counties where the state is TX.
Example:
Query: {Country: US, State: TX}
Matches Value IDs: [1, 2, 3, 4, 6]
Scores: [1, 2, 4, 8, 5(1+4)]
Result: {id: 4} (8 was the highest score so Broward returns)
How would you model something like this in Cassandra 2.1?
Found out the best way to achieve this was using Solr with Cassandra.
Somethings to note though about using Solr, since all the resources I needed were scattered amongst the internet.
You must first start Cassandra with Solr. There's a command with the dse tool for starting cassandra with Solr enabled.
$CASSANDRA_HOME/bin/dse cassandra -s
You must create your keyspace with network topology stategy and solr enabled.
CREATE KEYSPACE ... WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'Solr': 1}
Once you create your table within your solr enabled keyspace, create a core using the dsetool.
$CASSANDRA_HOME/bin/dsetool create_core keyspace.table_name generateResources=true reindex=true
This will allow solr to index your data and generate a number of secondary indexes against your cassandra table.
To perform the queries needed for columns where values may or may not exist requires a somewhat complex query.
SELECT * FROM keyspace.table_name WHERE solr_query = '{"q": "{(-column:[* TO *] AND *:*) OR column:value}"';
Finally, you may notice when searching for text, your solr query column:"Hello" may pick up other unwanted values like HelloWorld or HelloThere. This is due to the datatype used in your schema.xml for Solr. Here's how to modify this behavior:
Head to your Solr Admin UI. (Normally http://hostname:8983/solr/)
Choose your core in the drop down list in the left pane, should be named keyspace.table_name.
Look for Config or Schema, both should take you to the schema.xml.
Copy and paste that file to some text editor. Optionally, you could try using wget or curl to download the file, but you need the real link which is provided in the text field box to the top right.
There's a tag <fieldtype>, with the name TextField. Replace org.apache.solr.schema.TextField with org.apache.solr.schema.StrField. You must also remove the analyzers, StrField does not support those.
That's it, hopefully I've saved people from all the headaches I encountered.
I'm pretty new here so thank you in advance for the help. I'm trying to do some analysis of the entire Bitcoin transaction chain. In order to do that, I'm trying to create 2 tables
1) A full list of all Bitcoin addresses and their balance, i.e.,:
| ID | Address | Balance |
-------------------------------
| 1 | 7d4kExk... | 32 |
| 2 | 9Eckjes... | 0 |
| . | ... | ... |
2) A record of the number of transactions that have ever occurred between any two addresses in the Bitcoin network
| ID | Sender | Receiver | Transactions |
--------------------------------------------------
| 1 | 7d4kExk... | klDk39D... | 2 |
| 2 | 9Eckjes... | 7d4kExk... | 3 |
| . | ... | ... | .. |
To do this I've written a (probably very inefficient) script in R that loops through every block and scrapes blockexplorer.com to compile the tables. I've tried running it a couple of times so far but I'm running into two main issues
1 - It's very slow... I can imagine it's going to take at least a week at the rate that it's going
2 - I haven't been able to run it for more than a day or two without it hanging. It seems to just freeze RStudio.
I'd really appreaciate your help in two areas:
1 - Is there a better way to do this in R to make the code run significantly faster?
2 - Should I stop using R altogether for this and try a different approach?
Thanks in advance for the help! Please see below for the relevant chunks of code I'm using
url_start <- "http://blockexplorer.com/b/"
url_end <- ""
readUrl <- function(url) {
table <- try(readHTMLTable(url)[[1]])
if(inherits(table,"try-error")){
message(paste("URL does not seem to exist:", url))
errors <- errors + 1
return(NA)
} else {
processed <- processed + 1
return(table)
}
}
block_loop <- function (end, start = 0) {
...
addr_row <- 1 #starting row to fill out table
links_row <- 1 #starting row to fill out table
for (i in start:end) {
print(paste0("Reading block: ",i))
url <- paste(url_start,i,url_end, sep = "")
table <- readUrl(url)
if(is.na(table)){ next }
....
There are very close to 250,000 blocks on the site you mentioned (at least, 260,000 gives a 404). Curling from my connection (1 MB/s down) gives an average speed of about half a second. Try it yourself from the command line (just copy and paste) to see what you get:
curl -s -w "%{time_total}\n" -o /dev/null http://blockexplorer.com/b/220000
I'll assume your requests are about as fast as mine. Half a second times 250,000 is 125,000 seconds, or a day and a half. This is the absolute best you can get using any methods because you have to request the page.
Now, after doing an install.packages("XML"), I saw that running readHTMLTable(http://blockexplorer.com/b/220000) takes about five seconds on average. Five seconds times 250,000 is 1.25 million seconds which is about two weeks. So your estimates were correct; this is really, really slow. For reference, I'm running a 2011 MacBook Pro with a 2.2 GHz Intel Core i7 and 8GB of memory (1333 MHz).
Next, table merges in R are quite slow. Assuming 100 records per table row (seems about average) you'll have 25 million rows, and some of these rows have a kilobyte of data in them. Assuming you can fit this table in memory, concatenating tables will be a problem.
The solution to these problems that I'm most familiar with is to use Python instead of R, BeautifulSoup4 instead of readHTMLTable, and Pandas to replace R's dataframe. BeautifulSoup is fast (install lxml, a parser written in C) and easy to use, and Pandas is very quick too. Its dataframe class is modeled after R's, so you probably can work with it just fine. If you need something to request URLs and return the HTML for BeautifulSoup to parse, I'd suggest Requests. It's lean and simple, and the documentation is good. All of these are pip installable.
If you still run into problems the only thing I can think of is to get maybe 1% of the data in memory at a time, statistically reduce it, and move on to the next 1%. If you're on a machine similar to mine, you might not have another option.