MariaDB update query taking a long time - mariadb

I'm currently having some problems with our mysql replication. We're using a master-master setup for failover purposes.
The replication itself is working and I believe thats setup right. But we're having troubles with some queries that takes a cruciating time to execute.
Example:
| 166 | database | Connect | 35 | updating | update xx set xx =
'xx' where xx = 'xx' and xx = 'xx' | 0.000 |
These update queries are taking a 20-30+ seconds sometimes to complete and because of that the replication starts lagging behind and within a day, it will be behind for a couple of hours. Strange part is that it will eventually catchup with the other master.
The table is around ~100MM rows big and around 70GB large. On the master where the queries are executed they take less than a second.
Both configurations, mysql and server, are near identical and we tried optimizing the table and queries, but no luck so far.
Any recommendations we could try to solve this? Let me know if I can provide you with any more information.
Using:
MariaDB 10.1.35 -
CentOS 7.5.1804

The key aspect of this is how many rows are you updating:
If the percentage is low (less than 5% of the rows) then an index can help.
Otherwise, if you are updating a large number of rows (greater than 5%), a full table scan will be optimal. If you have millions of rows this will be slow. Maybe partitioning the table could help, but I would say you have little chances of improving it.
I'm going to assume you are updating a small percentage of rows, so you can use an index. Look at the condition in the WHERE statement. If it looks like this:
WHERE col1 = 'xx' and col2 = 'yy'
Then, an index on those columns will make your query faster. Specifically:
create index ix1 on my_table (col1, col2);
Depending on the selectivity of your columns the flipped index could be faster:
create index ix2 on my_table (col2, col1);
You'll need to try which one is better for your specific case.

Related

Snowflake, Python/Jupyter analysis

I am new to Snowflake, and running a query to get a couple of day's data - this returns more than 200 million rows, and take a few days. I tried running the same query in Jupyter - and the kernel restars/dies before the query ends. Even if it got into Jupyter - I doubt I could analyze the data in any reasonable timeline (but maybe using dask?).
I am not really sure where to start - I am trying to check the data for missing values, and my first instinct was to use Jupyter - but I am lost at the moment.
My next idea is to stay within Snowflake - and check the columns there with case statements (e.g. sum(case when column_value = '' then 1 else 0 end) as number_missing_values
Does anyone have any ideas/direction I could try - or know if I'm doing something wrong?
Thank you!
not really the answer you are looking for but
sum(case when column_value = '' then 1 else 0 end) as number_missing_values`
when you say missing value, this will only find values that are an empty string
this can also be written is a simpler form as:
count_if(column_value = '') as number_missing_values
The data base already knows how many rows are in a column, and it knows how many null columns there are. If loading data into a table, it might make more sense to not load empty strings, and use null instead then, for not compute cost you can run:
count(*) - count(column) as number_empty_values
also of note, if you have two tables in snowflake you can compare the via the MINUS
aka
select * from table_1
minus
select * from table_2
is useful to find missing rows, you do have to do it in both directions.
Then you can HASH rows, or hash the whole table via HASH_AGG
But normally when looking for missing data, you have an external system, so the driver is 'what can that system handle' and finding common ground.
Also in the past we where search for bugs in our processing that cause duplicate data (where we needed/wanted no duplicates) so then the above, and COUNT DISTINCT like commands come in useful.

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

Better "delete rows from table" performance

I have an RDF Graph in Oracle that has approx. 7 ,000, 000 triples(rows)
I have a simple select statement that get's old duplicates (triples) and it deletes them from this RDF Graph.
Now,
let's say my SELECT returns 300 results,
this gets computationally very expensive since the DELETE does a full scan of the TEST_tpl table 300 times and as I said TEST_tpl has approx.
7 ,000, 000 rows...
DELETE FROM TEST_tpl t WHERE t.triple.get_subject()
IN
(
SELECT rdf$stc_sub from rdf_stage_table_TEST
WHERE rdf$stc_pred LIKE '%DateTime%'
)
I am trying to find the way to create an oracle procedure that would go through table only once for multiple values...
Or maybe someone knows of a better way...
The way I solved this is I created an INDEX on triple.get_subject()
CREATE INDEX "SEMANTIC"."TEST_tpl_SUB_IDX"
ON
"SEMANTIC"."TEST_tpl" ("MDSYS"."SDO_RDF_TRIPLE_S"."GET_SUBJECT"("TRIPLE"))
This improved the performance tremendously.
Thank you #Justin Cave and # Michael for your help.

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.

sqlite: query to add (subtract) cells from adjacent rows and put result in new column

I am examining a .sqlite file in FireFox's SQLite Manager and need to see if any data was not collected. An example is worth a thousand words:
ReadDate ReadValue
1361900350183.00 137
1361899753183.00 139
1361900053183.00 138
The are no primary keys and the table is NOT sorted by ReadDate or time. [Changing the input table is not an option!]
What I'd like to do is produce with simple SQL a table that looks like this:
ReadDate ReadValue TimeOffset
1361899753183.00 139
1361900053183.00 138 300000 // this is ReadDate(1) - ReadDate(0)
1361900350183.00 137 297000 // this is ReadDate(2) - ReadDate(1)
This would allow me to inspect the data and see if any data values were not captured (TimeOffset would be much greater than 300000). I could also write an additional query to get a COUNT of all TimeOffsets beyond a threshold.
I'm having trouble getting going on what I imagine is a simple exercise. I know how to do joins and sorts (order by), but here I need to compare one row to another. Do I need a cursor? And how to get the extra column? I have a gut feeling that if I just knew the vocabulary a little better, I'd be able to come up with the search terms and find the answer quickly.
Many thanks,
Dave
First, add an (empty) column to your table:
ALTER TABLE MyTable ADD COLUMN TimeOffset NUMERIC;
Then, the TimeOffset for each record is the difference between the ReadDate column of this record and of the record with the next smaller ReadDate, i.e, the record with the largest ReadDate that is still smaller than this one's:
UPDATE MyTable
SET TimeOffset = ReadDate - (SELECT MAX(ReadDate)
FROM MyTable AS t2
WHERE t2.ReadDate < MyTable.ReadDate);

Resources