I have tried a search, seen that there is apparently a group-by-n-max tag, but the answers don't seem applicable to the problem I have.
I have a large set of data, recording scores, attempts (and of course a load of other crud) against a timestamp , the timestamp is almost like a bucket in itself.
What I currently have, is a relatively simplistic
select sum(score),sum(attempts),time from records group by time order by time asc;
This works well, apart from if the number of people changes per timestamp. So I need to limit the number to be consistent, say 40 to be summed within the group-by, and to make matters worse (although if the limit were achievable then order should be relatively similarly done), it would be an ordered list I would like to limit by.
The timestamp is calculable by doing a select against the table, then I guess it would be possible to do a join with a limit. However it feels like there should be an easier method. Unfortunately it is not an average that I want, otherwise I could of course just add a count to the group.
Edit: Yes, I should have included example input and output.
Input table on the left, note that for times 4 and 8, there are 4 people, a,b,c,d only a and d though are in all times. So limiting for example 3 people as an example. On the right, calculation of their rank within each time, so for times 4 and 8, people c and d are not within the top 3 of the score rank.
picture of input data and example rank calculation
So the basic sum() group by time gives too large a result for times where there are 4 people, i.e. time 4 and 8
image showing calculation of group-by, and desired output
Input (hmmm the table renders properly in the preview)
|-----|---------|-------|-------|
|score| attempts| time| user|
|-----|---------|-------|-------------
|10| 4| 4| a|
|9| 6| 5| a|
|12| 7| 6| a|
|4| 8| 7| a|
|6| 9| 8| a|
|13| 1| 4| b|
|5| 3| 6| b|
|6| 5| 7| b|
|7| 7| 8| b|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|24| 2| 4| c|
|2| 5| 5| c|
|1| 7| 7| c|
|5| 6| 8| c|
|5| 3| 4| d|
|3| 4| 5| d|
|5| 6| 6| d|
|7| 2| 8| d|
|--|--|--|--|
Desired output (see images for a better idea)
|-----|---------|-------|
|score| attempts| time|
|-----|---------|-----------|
|47| 7| 4|
|14| 15| 5|
|22| 16| 6|
|11| 20| 7|
|20| 18| 8|
|--|------|--|
Related
I want to use pyspark StandardScaler on 6 out of 10 columns in my dataframe. This will be part of a pipeline.
The inputCol parameter seems to expect a vector, which I can pass in after using VectorAssembler on all my features, but this scales all 10 features. I don’t want to scale the other 4 features because they are binary and I want unstandardized coefficients for them.
Am I supposed to use vector assembler on the 6 features, scale them, then use vector assembler again on this scaled features vector and the remaining 4 features? I would end up with a vector within a vector and I’m not sure this will work.
What’s the right way to do this? An example is appreciated.
You can do this by using VectorAssembler. They key is you have to extract the columns from the assembler output. See the code below for a working example,
from pyspark.ml.feature import MinMaxScaler, StandardScaler
from pyspark.ml.feature import VectorAssembler
import pandas as pd
import numpy as np
import random
df = pd.DataFrame()
df['a'] = random.sample(range(100), 10)
df['b'] = random.sample(range(100), 10)
df['c'] = random.sample(range(100), 10)
df['d'] = random.sample(range(100), 10)
df['e'] = random.sample(range(100), 10)
sdf = sc.createDataFrame(df)
sdf.show()
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 51| 13| 6| 5| 26|
| 18| 29| 19| 81| 28|
| 34| 1| 36| 57| 87|
| 56| 86| 51| 52| 48|
| 36| 49| 33| 15| 54|
| 87| 53| 47| 89| 85|
| 7| 14| 55| 13| 98|
| 70| 50| 32| 39| 58|
| 80| 20| 25| 54| 37|
| 40| 33| 44| 83| 27|
+---+---+---+---+---+
cols_to_scale = ['c', 'd', 'e']
cols_to_keep_unscaled = ['a', 'b']
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
assembler = VectorAssembler().setInputCols(cols_to_scale).setOutputCol("features")
sdf_transformed = assembler.transform(sdf)
scaler_model = scaler.fit(sdf_transformed.select("features"))
sdf_scaled = scaler_model.transform(sdf_transformed)
sdf_scaled.show()
+---+---+---+---+---+----------------+--------------------+
| a| b| c| d| e| features| scaledFeatures|
+---+---+---+---+---+----------------+--------------------+
| 51| 13| 6| 5| 26| [6.0,5.0,26.0]|[0.39358015146628...|
| 18| 29| 19| 81| 28|[19.0,81.0,28.0]|[1.24633714630991...|
| 34| 1| 36| 57| 87|[36.0,57.0,87.0]|[2.36148090879773...|
| 56| 86| 51| 52| 48|[51.0,52.0,48.0]|[3.34543128746345...|
| 36| 49| 33| 15| 54|[33.0,15.0,54.0]|[2.16469083306459...|
| 87| 53| 47| 89| 85|[47.0,89.0,85.0]|[3.08304451981926...|
| 7| 14| 55| 13| 98|[55.0,13.0,98.0]|[3.60781805510765...|
| 70| 50| 32| 39| 58|[32.0,39.0,58.0]|[2.09909414115354...|
| 80| 20| 25| 54| 37|[25.0,54.0,37.0]|[1.63991729777620...|
| 40| 33| 44| 83| 27|[44.0,83.0,27.0]|[2.88625444408612...|
+---+---+---+---+---+----------------+--------------------+
# Function just to convert to help build data frame
def extract(row):
return (row.a, row.b,) + tuple(row.scaledFeatures.toArray().tolist())
sdf_scaled = sdf_scaled.select(*cols_to_keep_unscaled, "scaledFeatures").rdd \
.map(extract).toDF(cols_to_keep_unscaled + cols_to_scale)
sdf_scaled.show()
+---+---+------------------+-------------------+------------------+
| a| b| c| d| e|
+---+---+------------------+-------------------+------------------+
| 51| 13|0.3935801514662892|0.16399957083190683|0.9667572801316145|
| 18| 29| 1.246337146309916| 2.656793047476891|1.0411232247571234|
| 34| 1|2.3614809087977355| 1.8695951074837378|3.2349185912096337|
| 56| 86|3.3454312874634584| 1.7055955366518312|1.7847826710122114|
| 36| 49| 2.164690833064591|0.49199871249572047| 2.007880504888738|
| 87| 53| 3.083044519819266| 2.9191923608079415|3.1605526465841245|
| 7| 14|3.6078180551076513| 0.4263988841629578| 3.643931286649932|
| 70| 50|2.0990941411535426| 1.2791966524888734|2.1566123941397555|
| 80| 20| 1.639917297776205| 1.7711953649845937| 1.375769975571913|
| 40| 33|2.8862544440861213| 2.7223928758096534| 1.003940252444369|
+---+---+------------------+-------------------+------------------+
I am R programmer moving into the pyspark world and have gotten a lot of the basic tricks down but something I am still struggling about is things I would do applys or basic for loops for.
In this case I am trying to calculate the "anti-groupby" for an ID. Basically the idea is to look at a population for that ID and then the population for not this ID and have both those values on the same row. The getting the population for that ID is easy using a groupby and then joining it to a dataset with new_id as the only column.
This is how I would do it in R:
anti_group <- function(id){
tr <- sum(subset(df1, new_id!=id)$total_1)
to <- sum(subset(df1, new_id!=id)$total_2)
54 * tr / to
}
test$other.RP54 <- sapply(test$new_id, anti_group )
How would I do it in pyspark?
Thanks!
Edit:
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
Then some function that creates a final dataframe that looks like this:
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
I think you can do that in two steps: first you sum by id then you take the total and substract by the value for this id.
My idea is a little bit like a group_by(id) %>% summarise(x = sum(x)) %>% mutate(y = sum(x) - x) in dplyr
The solution I propose is based on Window function. It is untested:
Let's first create the data
import pyspark.sql.functions as psf
import pyspark.sql.window as psw
df = spark.createDataFrame([(1,40),(1,30),(2,10),(2,90),(3,20),(3,10),(4,2),(4,5)], ['id','value'])
df.show(2)
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
+---+-----+
only showing top 2 rows
and then apply that approach:
w = psw.Window.orderBy()
df_id = df.groupBy("id").agg(psf.sum("value").alias("grouped_total"))
df_id = (df_id
.withColumn("anti_grouped_total",psf.sum("grouped_total").over(w))
.withColumn('anti_grouped_total', psf.col('anti_grouped_total') - psf.col('grouped_total'))
)
df_id.show(2)
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 3| 30| 177|
| 1| 70| 137|
+---+-------------+------------------+
only showing top 2 rows
So there's no in-built function that would replicate that groupBy function, but you could easily do it by creating a new column using case(when/otherwise clause) to create your group and anti-group, and then groupBy on that new column.
#df.show()
#sample data
+---+-----+
| id|value|
+---+-----+
| 1| 40|
| 1| 30|
| 2| 10|
| 2| 90|
| 3| 20|
| 3| 10|
| 4| 2|
| 4| 5|
+---+-----+
from pyspark.sql import functions as F
df.withColumn("anti_id_1", F.when(F.col("id")==1, F.lit('1')).otherwise(F.lit('Not_1')))\
.groupBy("anti_id_1").agg(F.sum("value").alias("sum")).show()
+---------+---+
|anti_id_1|sum|
+---------+---+
| 1| 70|
| Not_1|137|
+---------+---+
UPDATE:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w1=Window().partitionBy("id")
w=Window().partitionBy()
df.withColumn("grouped_total",F.sum("value").over(w1))\
.withColumn("anti_grouped_total", (F.sum("value").over(w))-F.col("grouped_total"))\
.groupBy("id").agg(F.first("grouped_total").alias("grouped_total"),\
F.first("anti_grouped_total").alias("anti_grouped_total"))\
.drop("value").orderBy("id").show()
+---+-------------+------------------+
| id|grouped_total|anti_grouped_total|
+---+-------------+------------------+
| 1| 70| 137|
| 2| 100| 107|
| 3| 30| 177|
| 4| 7| 200|
+---+-------------+------------------+
Less verbose/concise way to achieve the same output:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value").alias("grouped_total"))\
.withColumn("anti_grouped_total",F.sum("grouped_total").over(w)-F.col("grouped_total")).orderBy("id"),show()
For 2 value columns:
df.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
| 1| 40| 50|
| 1| 30| 70|
| 2| 10| 91|
| 2| 90| 21|
| 3| 20| 42|
| 3| 10| 4|
| 4| 2| 23|
| 4| 5| 12|
+---+------+------+
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window().partitionBy()
df.groupBy("id").agg(F.sum("value1").alias("grouped_total_1"),F.sum("value2").alias("grouped_total_2"))\
.withColumn("anti_grouped_total_1",F.sum("grouped_total_1").over(w)-F.col("grouped_total_1"))\
.withColumn("anti_grouped_total_2",F.sum("grouped_total_2").over(w)-F.col("grouped_total_2")).orderBy("id").show()
+---+---------------+---------------+--------------------+--------------------+
| id|grouped_total_1|grouped_total_2|anti_grouped_total_1|anti_grouped_total_2|
+---+---------------+---------------+--------------------+--------------------+
| 1| 70| 120| 137| 193|
| 2| 100| 112| 107| 201|
| 3| 30| 46| 177| 267|
| 4| 7| 35| 200| 278|
+---+---------------+---------------+--------------------+--------------------+
When executing "explain select..." on Sqlite, the execution plan is returned as a result. What do the columns mean?
The documentation simply says that they columns may change with each release. https://www.sqlite.org/eqp.html
Example:
addr| opcode| p1| p2| p3| p4| p5| comment
0| Init| 0| 15| 0| | 00| null
1| OpenRead| 0| 5| 0| 7| 00| null
2| Variable| 1| 1| 0| | 00| null
3| SeekRowid| 0| 14| 1| | 00| null
4| Copy| 1| 2| 0| | 00| null
This other part of the Sqlite documentation suggests that they are operands of the Sqlite bytecode engine. I had been hoping that they were time estimates for the execution time.
"Each instruction has an opcode and five operands named P1, P2 P3, P4, and P5."
https://www.sqlite.org/opcode.html
I am currently writing a meta-analysis using a pairwise random effects meta-analysis to compare the complication rates of 5 treatment modalities, with one as the gold standard. I was able to get RR's out of the analysis but was not able to get p-values out of it. How do I get the p-values from this model? I tried to usepval.random but that didn't work. And I couldn't find any other code on CRAN that would help me. Can someone help me with the R-code?
drf <- read.csv("drf zonder moroni.csv", sep = ";", header = TRUE, as.is = TRUE)
##
drf <- drf[, 1:5]
names(drf) <- c("study", "type", "treat", "events", "n")
compl <- subset(drf, type == "Complications")
library(netmeta)
p.compl <- pairwise(treat = treat, event = events, n = n,
studlab = study, data = compl)
n.compl <- netmeta(p.compl, reference = "PC", comb.random=TRUE)
n.compl
netgraph(n.compl, iterate = TRUE, number = TRUE)
Part of dataset
Study| Event Type| Treatment| Number of Events (n)| N| n/N|
Kumaravel| Complications| EF| 3| 23| 0,1304348|
Franck| Complications| EF| 2| 20| 0,1|
Schonnemann| Complications| EF| 8| 30| 0,2666667|
Aita| Complications| EF| 1| 16| 0,0625|
Hove| Complications| EF| 31| 39| 0,7948718|
Andersen| Complications| EF| 26| 75| 0,3466667|
Krughaug| Complications| EF| 22| 75| 0,2933333|
Moroni| Complications| EF| 0| 20| 0|
Plate| Complications| IMN| 3| 30| 0,1|
Chappuis| Complications| IMN| 4| 16| 0,25|
Gradl| Complications| IMN| 12| 66| 0,1818182|
Schonnemann| Complications| IMN| 6| 31| 0,1935484|
Aita| Complications| IMN| 1| 16| 0,0625|
Dremstrop| Complications| IMN| 17| 44| 0,3863636|
Wong| Complications| PC| 1| 30| 0,0333333|
Kumaravel| Complications| PC| 4| 25| 0,16|
P values can be derived by n.compl$pval.random
I want to crate matrix where by the result shows the difference between columns. I have used spapply which visually returns the correct results but as list. I have a one row DF. That is all numeric.
Example data:
|A| B| C|
|12|6| 7|
I want to return
A| B| C|
A| 0| 6| 5|
B| 6| 0| 1|
C| 5| 1| 0|
I have tried
mymatrix<-sapply(DF, function(x) abs(x - DF))
This returns a list that looks like matrix and gives correct information. I need it as matrix not a list.