What do the result columns of a Sqlite explain query mean? - sqlite

When executing "explain select..." on Sqlite, the execution plan is returned as a result. What do the columns mean?
The documentation simply says that they columns may change with each release. https://www.sqlite.org/eqp.html
Example:
addr| opcode| p1| p2| p3| p4| p5| comment
0| Init| 0| 15| 0| | 00| null
1| OpenRead| 0| 5| 0| 7| 00| null
2| Variable| 1| 1| 0| | 00| null
3| SeekRowid| 0| 14| 1| | 00| null
4| Copy| 1| 2| 0| | 00| null

This other part of the Sqlite documentation suggests that they are operands of the Sqlite bytecode engine. I had been hoping that they were time estimates for the execution time.
"Each instruction has an opcode and five operands named P1, P2 P3, P4, and P5."
https://www.sqlite.org/opcode.html

Related

SQLITE Sum of a group by time limited to top N

I have tried a search, seen that there is apparently a group-by-n-max tag, but the answers don't seem applicable to the problem I have.
I have a large set of data, recording scores, attempts (and of course a load of other crud) against a timestamp , the timestamp is almost like a bucket in itself.
What I currently have, is a relatively simplistic
select sum(score),sum(attempts),time from records group by time order by time asc;
This works well, apart from if the number of people changes per timestamp. So I need to limit the number to be consistent, say 40 to be summed within the group-by, and to make matters worse (although if the limit were achievable then order should be relatively similarly done), it would be an ordered list I would like to limit by.
The timestamp is calculable by doing a select against the table, then I guess it would be possible to do a join with a limit. However it feels like there should be an easier method. Unfortunately it is not an average that I want, otherwise I could of course just add a count to the group.
Edit: Yes, I should have included example input and output.
Input table on the left, note that for times 4 and 8, there are 4 people, a,b,c,d only a and d though are in all times. So limiting for example 3 people as an example. On the right, calculation of their rank within each time, so for times 4 and 8, people c and d are not within the top 3 of the score rank.
picture of input data and example rank calculation
So the basic sum() group by time gives too large a result for times where there are 4 people, i.e. time 4 and 8
image showing calculation of group-by, and desired output
Input (hmmm the table renders properly in the preview)
|-----|---------|-------|-------|
|score| attempts| time| user|
|-----|---------|-------|-------------
|10| 4| 4| a|
|9| 6| 5| a|
|12| 7| 6| a|
|4| 8| 7| a|
|6| 9| 8| a|
|13| 1| 4| b|
|5| 3| 6| b|
|6| 5| 7| b|
|7| 7| 8| b|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|--|--|--|--|
Desired output (see images for a better idea)
|-----|---------|-------|
|score| attempts| time|
|-----|---------|-----------|
|47| 7| 4|
|14| 15| 5|
|22| 16| 6|
|11| 20| 7|
|20| 18| 8|
|--|------|--|

How to implement PySpark StandardScaler on subset of columns?

I want to use pyspark StandardScaler on 6 out of 10 columns in my dataframe. This will be part of a pipeline.
The inputCol parameter seems to expect a vector, which I can pass in after using VectorAssembler on all my features, but this scales all 10 features. I don’t want to scale the other 4 features because they are binary and I want unstandardized coefficients for them.
Am I supposed to use vector assembler on the 6 features, scale them, then use vector assembler again on this scaled features vector and the remaining 4 features? I would end up with a vector within a vector and I’m not sure this will work.
What’s the right way to do this? An example is appreciated.
You can do this by using VectorAssembler. They key is you have to extract the columns from the assembler output. See the code below for a working example,
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.feature import VectorAssembler
import pandas as pd
import numpy as np
import random
df = pd.DataFrame()
df['a'] = random.sample(range(100), 10)
df['b'] = random.sample(range(100), 10)
df['c'] = random.sample(range(100), 10)
df['d'] = random.sample(range(100), 10)
df['e'] = random.sample(range(100), 10)
sdf = sc.createDataFrame(df)
sdf.show()
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 51| 13| 6| 5| 26|
| 18| 29| 19| 81| 28|
| 34| 1| 36| 57| 87|
| 56| 86| 51| 52| 48|
| 36| 49| 33| 15| 54|
| 87| 53| 47| 89| 85|
| 7| 14| 55| 13| 98|
| 70| 50| 32| 39| 58|
| 80| 20| 25| 54| 37|
| 40| 33| 44| 83| 27|
+---+---+---+---+---+
cols_to_scale = ['c', 'd', 'e']
cols_to_keep_unscaled = ['a', 'b']
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
assembler = VectorAssembler().setInputCols(cols_to_scale).setOutputCol("features")
sdf_transformed = assembler.transform(sdf)
scaler_model = scaler.fit(sdf_transformed.select("features"))
sdf_scaled = scaler_model.transform(sdf_transformed)
sdf_scaled.show()
+---+---+---+---+---+----------------+--------------------+
| a| b| c| d| e| features| scaledFeatures|
+---+---+---+---+---+----------------+--------------------+
| 51| 13| 6| 5| 26| [6.0,5.0,26.0]|[0.39358015146628...|
| 18| 29| 19| 81| 28|[19.0,81.0,28.0]|[1.24633714630991...|
| 34| 1| 36| 57| 87|[36.0,57.0,87.0]|[2.36148090879773...|
| 56| 86| 51| 52| 48|[51.0,52.0,48.0]|[3.34543128746345...|
| 36| 49| 33| 15| 54|[33.0,15.0,54.0]|[2.16469083306459...|
| 87| 53| 47| 89| 85|[47.0,89.0,85.0]|[3.08304451981926...|
| 7| 14| 55| 13| 98|[55.0,13.0,98.0]|[3.60781805510765...|
| 70| 50| 32| 39| 58|[32.0,39.0,58.0]|[2.09909414115354...|
| 80| 20| 25| 54| 37|[25.0,54.0,37.0]|[1.63991729777620...|
| 40| 33| 44| 83| 27|[44.0,83.0,27.0]|[2.88625444408612...|
+---+---+---+---+---+----------------+--------------------+
# Function just to convert to help build data frame
def extract(row):
return (row.a, row.b,) + tuple(row.scaledFeatures.toArray().tolist())
sdf_scaled = sdf_scaled.select(*cols_to_keep_unscaled, "scaledFeatures").rdd \
.map(extract).toDF(cols_to_keep_unscaled + cols_to_scale)
sdf_scaled.show()
+---+---+------------------+-------------------+------------------+
| a| b| c| d| e|
+---+---+------------------+-------------------+------------------+
| 51| 13|0.3935801514662892|0.16399957083190683|0.9667572801316145|
| 18| 29| 1.246337146309916| 2.656793047476891|1.0411232247571234|
| 34| 1|2.3614809087977355| 1.8695951074837378|3.2349185912096337|
| 56| 86|3.3454312874634584| 1.7055955366518312|1.7847826710122114|
| 36| 49| 2.164690833064591|0.49199871249572047| 2.007880504888738|
| 87| 53| 3.083044519819266| 2.9191923608079415|3.1605526465841245|
| 7| 14|3.6078180551076513| 0.4263988841629578| 3.643931286649932|
| 70| 50|2.0990941411535426| 1.2791966524888734|2.1566123941397555|
| 80| 20| 1.639917297776205| 1.7711953649845937| 1.375769975571913|
| 40| 33|2.8862544440861213| 2.7223928758096534| 1.003940252444369|
+---+---+------------------+-------------------+------------------+

Anti group by/R apply in Pyspark

I am R programmer moving into the pyspark world and have gotten a lot of the basic tricks down but something I am still struggling about is things I would do applys or basic for loops for.
In this case I am trying to calculate the "anti-groupby" for an ID. Basically the idea is to look at a population for that ID and then the population for not this ID and have both those values on the same row. The getting the population for that ID is easy using a groupby and then joining it to a dataset with new_id as the only column.
This is how I would do it in R:
anti_group <- function(id){
tr <- sum(subset(df1, new_id!=id)$total_1)
to <- sum(subset(df1, new_id!=id)$total_2)
54 * tr / to
}
test$other.RP54 <- sapply(test$new_id, anti_group )
How would I do it in pyspark?
Thanks!
Edit:
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
Then some function that creates a final dataframe that looks like this:
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
I think you can do that in two steps: first you sum by id then you take the total and substract by the value for this id.
My idea is a little bit like a group_by(id) %>% summarise(x = sum(x)) %>% mutate(y = sum(x) - x) in dplyr
The solution I propose is based on Window function. It is untested:
Let's first create the data
import pyspark.sql.functions as psf
import pyspark.sql.window as psw
df = spark.createDataFrame([(1,40),(1,30),(2,10),(2,90),(3,20),(3,10),(4,2),(4,5)], ['id','value'])
df.show(2)
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
+---+-----+
only showing top 2 rows
and then apply that approach:
w = psw.Window.orderBy()
df_id = df.groupBy("id").agg(psf.sum("value").alias("grouped_total"))
df_id = (df_id
.withColumn("anti_grouped_total",psf.sum("grouped_total").over(w))
.withColumn('anti_grouped_total', psf.col('anti_grouped_total') - psf.col('grouped_total'))
)
df_id.show(2)
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 3| 30| 177|
| 1| 70| 137|
+---+-------------+------------------+
only showing top 2 rows
So there's no in-built function that would replicate that groupBy function, but you could easily do it by creating a new column using case(when/otherwise clause) to create your group and anti-group, and then groupBy on that new column.
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
from pyspark.sql import functions as F
df.withColumn("anti_id_1", F.when(F.col("id")==1, F.lit('1')).otherwise(F.lit('Not_1')))\
.groupBy("anti_id_1").agg(F.sum("value").alias("sum")).show()
+---------+---+
|anti_id_1|sum|
+---------+---+
| 1| 70|
| Not_1|137|
+---------+---+
UPDATE:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().partitionBy("id")
w=Window().partitionBy()
df.withColumn("grouped_total",F.sum("value").over(w1))\
.withColumn("anti_grouped_total", (F.sum("value").over(w))-F.col("grouped_total"))\
.groupBy("id").agg(F.first("grouped_total").alias("grouped_total"),\
F.first("anti_grouped_total").alias("anti_grouped_total"))\
.drop("value").orderBy("id").show()
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
Less verbose/concise way to achieve the same output:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value").alias("grouped_total"))\
.withColumn("anti_grouped_total",F.sum("grouped_total").over(w)-F.col("grouped_total")).orderBy("id"),show()
For 2 value columns:
df.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
| 1| 40| 50|
| 1| 30| 70|
| 2| 10| 91|
| 2| 90| 21|
| 3| 20| 42|
| 3| 10| 4|
| 4| 2| 23|
| 4| 5| 12|
+---+------+------+
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value1").alias("grouped_total_1"),F.sum("value2").alias("grouped_total_2"))\
.withColumn("anti_grouped_total_1",F.sum("grouped_total_1").over(w)-F.col("grouped_total_1"))\
.withColumn("anti_grouped_total_2",F.sum("grouped_total_2").over(w)-F.col("grouped_total_2")).orderBy("id").show()
+---+---------------+---------------+--------------------+--------------------+
| id|grouped_total_1|grouped_total_2|anti_grouped_total_1|anti_grouped_total_2|
+---+---------------+---------------+--------------------+--------------------+
| 1| 70| 120| 137| 193|
| 2| 100| 112| 107| 201|
| 3| 30| 46| 177| 267|
| 4| 7| 35| 200| 278|
+---+---------------+---------------+--------------------+--------------------+

Count rows in a dataframe object with criteria in R

Okay, I have a bit of a noob question, so please excuse me. I have a data frame object as follows:
| order_id| department_id|department | n|
|--------:|-------------:|:-------------|--:|
| 1| 4|produce | 4|
| 1| 15|canned goods | 1|
| 1| 16|dairy eggs | 3|
| 36| 4|produce | 3|
| 36| 7|beverages | 1|
| 36| 16|dairy eggs | 3|
| 36| 20|deli | 1|
| 38| 1|frozen | 1|
| 38| 4|produce | 6|
| 38| 13|pantry | 1|
| 38| 19|snacks | 1|
| 96| 1|frozen | 2|
| 96| 4|produce | 4|
| 96| 20|deli | 1|
This is the code I've used to arrive at this object:
temp5 <- opt %>%
left_join(products,by="product_id")%>%
left_join(departments,by="department_id") %>%
group_by(order_id,department_id,department) %>%
tally() %>%
group_by(department_id)
kable(head(temp5,14))
As you can see, the object contains, departments present in each Order_id. Now, what I want to do is, I want to count the number of departments for each order_id
i tried using the summarise() method in the dplyr package, but it throws the following error:
Error in summarise_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "factor".
It seems so simple, but cant fig out how to do it. Any help will be appreciated.
Edit: This is the code that I tried to run, post which I read about the count() function in the plyr package, i tried to use that as well, but that is of no use as it needs a data frame as input, whereas I only want to count the no of occurrences in the data frame
temp5 <- opt %>%
+ left_join(products,by="product_id")%>%
+ left_join(departments,by="department_id") %>%
+ group_by(order_id,department_id,department) %>%
+ tally() %>%
+ group_by(department_id) %>%
+ summarise(count(department))
In the output, I need to know the average no. of departments ordered from in each order id, so i need something like this:
Order_id | no. of departments
1 3
36 4
38 4
96 3
And then I should be able to plot using ggplot, no. of orders vs no. of departments in each order. Hope this is clear

converting data frame to matrix, sapply is returning a list

I want to crate matrix where by the result shows the difference between columns. I have used spapply which visually returns the correct results but as list. I have a one row DF. That is all numeric.
Example data:
|A| B| C|
|12|6| 7|
I want to return
A| B| C|
A| 0| 6| 5|
B| 6| 0| 1|
C| 5| 1| 0|
I have tried
mymatrix<-sapply(DF, function(x) abs(x - DF))
This returns a list that looks like matrix and gives correct information. I need it as matrix not a list.

Resources