Anti group by/R apply in Pyspark - r

I am R programmer moving into the pyspark world and have gotten a lot of the basic tricks down but something I am still struggling about is things I would do applys or basic for loops for.
In this case I am trying to calculate the "anti-groupby" for an ID. Basically the idea is to look at a population for that ID and then the population for not this ID and have both those values on the same row. The getting the population for that ID is easy using a groupby and then joining it to a dataset with new_id as the only column.
This is how I would do it in R:
anti_group <- function(id){
tr <- sum(subset(df1, new_id!=id)$total_1)
to <- sum(subset(df1, new_id!=id)$total_2)
54 * tr / to
}
test$other.RP54 <- sapply(test$new_id, anti_group )
How would I do it in pyspark?
Thanks!
Edit:
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
Then some function that creates a final dataframe that looks like this:
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+

I think you can do that in two steps: first you sum by id then you take the total and substract by the value for this id.
My idea is a little bit like a group_by(id) %>% summarise(x = sum(x)) %>% mutate(y = sum(x) - x) in dplyr
The solution I propose is based on Window function. It is untested:
Let's first create the data
import pyspark.sql.functions as psf
import pyspark.sql.window as psw
df = spark.createDataFrame([(1,40),(1,30),(2,10),(2,90),(3,20),(3,10),(4,2),(4,5)], ['id','value'])
df.show(2)
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
+---+-----+
only showing top 2 rows
and then apply that approach:
w = psw.Window.orderBy()
df_id = df.groupBy("id").agg(psf.sum("value").alias("grouped_total"))
df_id = (df_id
.withColumn("anti_grouped_total",psf.sum("grouped_total").over(w))
.withColumn('anti_grouped_total', psf.col('anti_grouped_total') - psf.col('grouped_total'))
)
df_id.show(2)
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 3| 30| 177|
| 1| 70| 137|
+---+-------------+------------------+
only showing top 2 rows

So there's no in-built function that would replicate that groupBy function, but you could easily do it by creating a new column using case(when/otherwise clause) to create your group and anti-group, and then groupBy on that new column.
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
from pyspark.sql import functions as F
df.withColumn("anti_id_1", F.when(F.col("id")==1, F.lit('1')).otherwise(F.lit('Not_1')))\
.groupBy("anti_id_1").agg(F.sum("value").alias("sum")).show()
+---------+---+
|anti_id_1|sum|
+---------+---+
| 1| 70|
| Not_1|137|
+---------+---+
UPDATE:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().partitionBy("id")
w=Window().partitionBy()
df.withColumn("grouped_total",F.sum("value").over(w1))\
.withColumn("anti_grouped_total", (F.sum("value").over(w))-F.col("grouped_total"))\
.groupBy("id").agg(F.first("grouped_total").alias("grouped_total"),\
F.first("anti_grouped_total").alias("anti_grouped_total"))\
.drop("value").orderBy("id").show()
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
Less verbose/concise way to achieve the same output:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value").alias("grouped_total"))\
.withColumn("anti_grouped_total",F.sum("grouped_total").over(w)-F.col("grouped_total")).orderBy("id"),show()
For 2 value columns:
df.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
| 1| 40| 50|
| 1| 30| 70|
| 2| 10| 91|
| 2| 90| 21|
| 3| 20| 42|
| 3| 10| 4|
| 4| 2| 23|
| 4| 5| 12|
+---+------+------+
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value1").alias("grouped_total_1"),F.sum("value2").alias("grouped_total_2"))\
.withColumn("anti_grouped_total_1",F.sum("grouped_total_1").over(w)-F.col("grouped_total_1"))\
.withColumn("anti_grouped_total_2",F.sum("grouped_total_2").over(w)-F.col("grouped_total_2")).orderBy("id").show()
+---+---------------+---------------+--------------------+--------------------+
| id|grouped_total_1|grouped_total_2|anti_grouped_total_1|anti_grouped_total_2|
+---+---------------+---------------+--------------------+--------------------+
| 1| 70| 120| 137| 193|
| 2| 100| 112| 107| 201|
| 3| 30| 46| 177| 267|
| 4| 7| 35| 200| 278|
+---+---------------+---------------+--------------------+--------------------+

Related

How to not automatically sort my y axis bar plot in ggplot & lemon

I am trying to display data by Species that has different values depending on group Letter. The best way I have found to display my data is by putting my categorical data on the y-axis and displaying the Total_Observed on the x-axis. Lemon allows me to have different y-axis labels. Unfortunately, the graph sorts by my y-axis labels instead of using my data as is, which is sorted by most abundant species to least abundant. Any suggestions?
Using libraries: dplyr, ggplot2, lemon
My data:
|Letter |Species | Total_Observed|
|:------|:------------------------|--------------:|
|A |Yellowtail snapper | 155|
|A |Sharksucker | 119|
|A |Tomtate | 116|
|A |Mutton snapper | 104|
|A |Little tunny | 96|
|B |Vermilion snapper | 1655|
|B |Red snapper | 1168|
|B |Gray triggerfish | 689|
|B |Tomtate | 477|
|B |Red porgy | 253|
|C |Red snapper | 391|
|C |Vermilion snapper | 114|
|C |Lane snapper | 95|
|C |Atlantic sharpnose shark | 86|
|C |Tomtate | 73|
|D |Lane snapper | 627|
|D |Red grouper | 476|
|D |White grunt | 335|
|D |Gray snapper | 102|
|D |Sand perch | 50|
|E |White grunt | 515|
|E |Red grouper | 426|
|E |Red snapper | 150|
|E |Black sea bass | 142|
|E |Lane snapper | 88|
|E |Gag | 88|
|F |Yellowtail snapper | 385|
|F |White grunt | 105|
|F |Gray snapper | 88|
|F |Mutton snapper | 82|
|F |Lane snapper | 59|
Then I run the code for my ggplot/lemon
ggplot(test,aes(y=Species,x=Total_Observed))+geom_histogram(stat='identity')+facet_wrap(.~test$Letter,scales='free_y')
And my graphs print like this:
Answered by Johan Rosa's shared blog (https://juliasilge.com/blog/reorder-within/): The solution is to use the library(tidytext). With the functions reorder_within and scale_x_reordered.
The corrected code:
test %>% mutate(Species=reorder_within(Species,Total_Observed,Letter)) %>% ggplot(aes(Species,Total_Observed))+geom_histogram(stat='identity')+facet_wrap(~Letter,scales='free_y')+coord_flip()+scale_x_reordered()
Will now generate the graphs ordered correctly

SQLITE Sum of a group by time limited to top N

I have tried a search, seen that there is apparently a group-by-n-max tag, but the answers don't seem applicable to the problem I have.
I have a large set of data, recording scores, attempts (and of course a load of other crud) against a timestamp , the timestamp is almost like a bucket in itself.
What I currently have, is a relatively simplistic
select sum(score),sum(attempts),time from records group by time order by time asc;
This works well, apart from if the number of people changes per timestamp. So I need to limit the number to be consistent, say 40 to be summed within the group-by, and to make matters worse (although if the limit were achievable then order should be relatively similarly done), it would be an ordered list I would like to limit by.
The timestamp is calculable by doing a select against the table, then I guess it would be possible to do a join with a limit. However it feels like there should be an easier method. Unfortunately it is not an average that I want, otherwise I could of course just add a count to the group.
Edit: Yes, I should have included example input and output.
Input table on the left, note that for times 4 and 8, there are 4 people, a,b,c,d only a and d though are in all times. So limiting for example 3 people as an example. On the right, calculation of their rank within each time, so for times 4 and 8, people c and d are not within the top 3 of the score rank.
picture of input data and example rank calculation
So the basic sum() group by time gives too large a result for times where there are 4 people, i.e. time 4 and 8
image showing calculation of group-by, and desired output
Input (hmmm the table renders properly in the preview)
|-----|---------|-------|-------|
|score| attempts| time| user|
|-----|---------|-------|-------------
|10| 4| 4| a|
|9| 6| 5| a|
|12| 7| 6| a|
|4| 8| 7| a|
|6| 9| 8| a|
|13| 1| 4| b|
|5| 3| 6| b|
|6| 5| 7| b|
|7| 7| 8| b|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|--|--|--|--|
Desired output (see images for a better idea)
|-----|---------|-------|
|score| attempts| time|
|-----|---------|-----------|
|47| 7| 4|
|14| 15| 5|
|22| 16| 6|
|11| 20| 7|
|20| 18| 8|
|--|------|--|

How to put parenthesis on knitr_kable object

I have a knitr_kable object. I want to put parenthesis on every value in every consecutive (2,4,6....) row. How I can do this.
out %>% knitr::kable(digits = 2, "pipe")
| | 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 10-1|
|:-------------|-----:|-----:|-----:|----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
|Excess Return | -0.20| 1.20| 0.97| 1.04| 0.77| 0.82| 0.79| 0.76| 0.76| 0.59| 0.79|
|NW t_Stat | -0.28| 2.04| 3.02| 3.17| 2.56| 3.16| 3.09| 3.45| 3.86| 3.48| 1.13|
|FF5 Alpha | -0.97| 0.43| -0.05| 0.02| -0.26| -0.17| -0.18| -0.19| -0.11| -0.11| 0.86|
|NW t_stats | -1.41| 1.03| -0.26| 0.14| -2.75| -2.10| -2.32| -2.85| -1.77| -5.01| 1.26|
|MktRF beta | 0.92| 1.00| 1.07| 1.08| 1.13| 1.11| 1.10| 1.08| 1.06| 0.99| 0.07|
|NW t_Stat | 1.34| 2.40| 5.44| 6.49| 11.98| 13.51| 14.58| 16.13| 16.43| 46.90| 0.10|
|SMB5 beta | 1.35| 1.34| 1.46| 1.26| 1.14| 1.03| 0.84| 0.69| 0.36| -0.16| -1.51|
|NW t_Stat | 1.97| 3.22| 7.41| 7.56| 12.11| 12.44| 11.05| 10.28| 5.61| -7.37| -2.20|
|HML beta | 0.52| 1.27| 0.48| 0.46| 0.55| 0.51| 0.36| 0.29| 0.22| 0.10| -0.42|
|NW t_Stat | 0.75| 3.05| 2.46| 2.75| 5.82| 6.23| 4.70| 4.38| 3.47| 4.53| -0.62|
|CMA beta | 0.43| -0.66| 0.15| 0.28| 0.11| -0.02| 0.10| 0.11| 0.13| 0.08| -0.34|
|NW t_stat | 0.62| -1.59| 0.77| 1.65| 1.14| -0.28| 1.38| 1.61| 1.96| 3.92| -0.50|
|RMW beta | -0.68| -0.25| 0.11| 0.08| 0.15| 0.24| 0.32| 0.35| 0.31| 0.17| 0.84|
|NW t_stat | -0.98| -0.60| 0.54| 0.47| 1.56| 2.96| 4.22| 5.18| 4.75| 7.87| 1.23|
|Adj_r_square | 0.14| 0.31| 0.64| 0.73| 0.87| 0.93| 0.94| 0.94| 0.93| 0.98| 0.09|

How to implement PySpark StandardScaler on subset of columns?

I want to use pyspark StandardScaler on 6 out of 10 columns in my dataframe. This will be part of a pipeline.
The inputCol parameter seems to expect a vector, which I can pass in after using VectorAssembler on all my features, but this scales all 10 features. I don’t want to scale the other 4 features because they are binary and I want unstandardized coefficients for them.
Am I supposed to use vector assembler on the 6 features, scale them, then use vector assembler again on this scaled features vector and the remaining 4 features? I would end up with a vector within a vector and I’m not sure this will work.
What’s the right way to do this? An example is appreciated.
You can do this by using VectorAssembler. They key is you have to extract the columns from the assembler output. See the code below for a working example,
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.feature import VectorAssembler
import pandas as pd
import numpy as np
import random
df = pd.DataFrame()
df['a'] = random.sample(range(100), 10)
df['b'] = random.sample(range(100), 10)
df['c'] = random.sample(range(100), 10)
df['d'] = random.sample(range(100), 10)
df['e'] = random.sample(range(100), 10)
sdf = sc.createDataFrame(df)
sdf.show()
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 51| 13| 6| 5| 26|
| 18| 29| 19| 81| 28|
| 34| 1| 36| 57| 87|
| 56| 86| 51| 52| 48|
| 36| 49| 33| 15| 54|
| 87| 53| 47| 89| 85|
| 7| 14| 55| 13| 98|
| 70| 50| 32| 39| 58|
| 80| 20| 25| 54| 37|
| 40| 33| 44| 83| 27|
+---+---+---+---+---+
cols_to_scale = ['c', 'd', 'e']
cols_to_keep_unscaled = ['a', 'b']
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
assembler = VectorAssembler().setInputCols(cols_to_scale).setOutputCol("features")
sdf_transformed = assembler.transform(sdf)
scaler_model = scaler.fit(sdf_transformed.select("features"))
sdf_scaled = scaler_model.transform(sdf_transformed)
sdf_scaled.show()
+---+---+---+---+---+----------------+--------------------+
| a| b| c| d| e| features| scaledFeatures|
+---+---+---+---+---+----------------+--------------------+
| 51| 13| 6| 5| 26| [6.0,5.0,26.0]|[0.39358015146628...|
| 18| 29| 19| 81| 28|[19.0,81.0,28.0]|[1.24633714630991...|
| 34| 1| 36| 57| 87|[36.0,57.0,87.0]|[2.36148090879773...|
| 56| 86| 51| 52| 48|[51.0,52.0,48.0]|[3.34543128746345...|
| 36| 49| 33| 15| 54|[33.0,15.0,54.0]|[2.16469083306459...|
| 87| 53| 47| 89| 85|[47.0,89.0,85.0]|[3.08304451981926...|
| 7| 14| 55| 13| 98|[55.0,13.0,98.0]|[3.60781805510765...|
| 70| 50| 32| 39| 58|[32.0,39.0,58.0]|[2.09909414115354...|
| 80| 20| 25| 54| 37|[25.0,54.0,37.0]|[1.63991729777620...|
| 40| 33| 44| 83| 27|[44.0,83.0,27.0]|[2.88625444408612...|
+---+---+---+---+---+----------------+--------------------+
# Function just to convert to help build data frame
def extract(row):
return (row.a, row.b,) + tuple(row.scaledFeatures.toArray().tolist())
sdf_scaled = sdf_scaled.select(*cols_to_keep_unscaled, "scaledFeatures").rdd \
.map(extract).toDF(cols_to_keep_unscaled + cols_to_scale)
sdf_scaled.show()
+---+---+------------------+-------------------+------------------+
| a| b| c| d| e|
+---+---+------------------+-------------------+------------------+
| 51| 13|0.3935801514662892|0.16399957083190683|0.9667572801316145|
| 18| 29| 1.246337146309916| 2.656793047476891|1.0411232247571234|
| 34| 1|2.3614809087977355| 1.8695951074837378|3.2349185912096337|
| 56| 86|3.3454312874634584| 1.7055955366518312|1.7847826710122114|
| 36| 49| 2.164690833064591|0.49199871249572047| 2.007880504888738|
| 87| 53| 3.083044519819266| 2.9191923608079415|3.1605526465841245|
| 7| 14|3.6078180551076513| 0.4263988841629578| 3.643931286649932|
| 70| 50|2.0990941411535426| 1.2791966524888734|2.1566123941397555|
| 80| 20| 1.639917297776205| 1.7711953649845937| 1.375769975571913|
| 40| 33|2.8862544440861213| 2.7223928758096534| 1.003940252444369|
+---+---+------------------+-------------------+------------------+

What do the result columns of a Sqlite explain query mean?

When executing "explain select..." on Sqlite, the execution plan is returned as a result. What do the columns mean?
The documentation simply says that they columns may change with each release. https://www.sqlite.org/eqp.html
Example:
addr| opcode| p1| p2| p3| p4| p5| comment
0| Init| 0| 15| 0| | 00| null
1| OpenRead| 0| 5| 0| 7| 00| null
2| Variable| 1| 1| 0| | 00| null
3| SeekRowid| 0| 14| 1| | 00| null
4| Copy| 1| 2| 0| | 00| null
This other part of the Sqlite documentation suggests that they are operands of the Sqlite bytecode engine. I had been hoping that they were time estimates for the execution time.
"Each instruction has an opcode and five operands named P1, P2 P3, P4, and P5."
https://www.sqlite.org/opcode.html

Resources