Is there an elegant pyspark solution for the following ranking problem? - math

is there an elegant function that exists for the following problem?
I've been tasked to create a function to determine the differences in days and rank the values. The closest positive number would rank as 0 and be the 'starting point'. From the starting point, depending on whether the ranked value is negative, or non-negative, the function would assign a rank to the value either positive or negative respectively.
Datediff()
Rank
-50
-3
-32
-2
-1
-1
5
0
14
1
32
2
128
3
254
4
My solution so far would be to separate the negative and positive numbers and use the window.partitionBy() function to assign the correlating rank. It would work, but I'm curious for a more elegant solution. :)

You can use Window to generate serial numbers to use as rank:
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window
df = spark.createDataFrame([(10,),(-10,),(5,),(-5,),(15,),(-15,)], ["dated_diff"])
window = Window.orderBy(df["dated_diff"])
df = df.select("dated_diff", row_number().over(window).alias("row_number"))
df.show()
+----------+----------+
|dated_diff|row_number|
+----------+----------+
| -15| 1|
| -10| 2|
| -5| 3|
| 5| 4|
| 10| 5|
| 15| 6|
+----------+----------+
Then find rank of first positive number:
first_positive_rank = df.filter("dated_diff>=0").first()["row_number"]
print(first_positive_rank)
>> 4
OR
first_positive_rank = df.filter("dated_diff<0").count() + 1
print(first_positive_rank)
>> 4
And finally subtract that rank from rank of all:
df = df.withColumn("row_number", col("row_number") - first_positive_rank)
df.show()
+----------+----------+
|dated_diff|row_number|
+----------+----------+
| -15| -3|
| -10| -2|
| -5| -1|
| 5| 0|
| 10| 1|
| 15| 2|
+----------+----------+

Related

Dictionary-style conditional referencing between two dataframes

I have two data frames. DF1 is a list of homicides with a date and location attached to each row. DF2 consists of a set of shared locations mentioned in DF1.
DF2 contains a latitude and longitude for each unique location. I want to pull these out. NOTE: DF2 contains shared locations, which may correspond to multiple homicides in DF1, which means the two DFs are different lengths.
I want to create latitude and longitude vars in DF1 when a location in DF2 is equal to the location in DF1 (assuming location names are exact between the two DFs). How do I pull the latitude and longitude from DF2 for which the location in DF2 corresponds to a given homicide record in DF1?
Small reproducible example:
DF1: (dataframe of incidents)
| Incident | Place |
| -------- | -------|
| Incident 1| Place 1|
| Incident 2| Place 2|
| Incident 3| Place 2|
| Incident 4| Place 3|
| Incident 5| Place 1|
| Incident 6| Place 3|
| Incident 7| Place 2|
DF2: (dictionary-style lat-lon manual)
| Place |Latitude |Longitude |
| -------| ------- | ---------|
| Place 1| A | B |
| Place 2| C | D |
| Place 3| E | F |
| Place 4| G | H |
DF3 (what I want)
| Incident | Latitude | Longitude |
| -------- | -------- | --------- |
|Incident 1| A | B |
|Incident 2| C | D |
|Incident 3| C | D |
|Incident 4| E | F |
|Incident 5| A | B |
|Incident 6| E | F |
|Incident 7| C | D |
I have tried:
DF1$latitude <- DF2$latitude[which(DF2$location == DF1$location), ]
It returned the following error:
Error in DF2$latitude[which(DF2$location == DF1$location), ] :
incorrect number of dimensions
In addition: Warning message:
In DF2$location == DF1$location :
longer object length is not a multiple of shorter object length
In response to a comment suggestion, I also tried:
DF2$latitude[which(DF2$location == DF1$location)]
However, I got the error:
Error in `$<-.data.frame`(`*tmp*`, latitude, value = numeric(0)) :
replacement has 0 rows, data has 1220
In addition: Warning message:
In DF1$location == DF2$location :
longer object length is not a multiple of shorter object length
You can try dplyr's left_join(). The code below keeps all the rows in DF1 and add variables in DF2 if it finds a match based on location.
library(dplyr)
DF3 <- left_join(DF1, DF2, by = "location")

Count rows in a dataframe object with criteria in R

Okay, I have a bit of a noob question, so please excuse me. I have a data frame object as follows:
| order_id| department_id|department | n|
|--------:|-------------:|:-------------|--:|
| 1| 4|produce | 4|
| 1| 15|canned goods | 1|
| 1| 16|dairy eggs | 3|
| 36| 4|produce | 3|
| 36| 7|beverages | 1|
| 36| 16|dairy eggs | 3|
| 36| 20|deli | 1|
| 38| 1|frozen | 1|
| 38| 4|produce | 6|
| 38| 13|pantry | 1|
| 38| 19|snacks | 1|
| 96| 1|frozen | 2|
| 96| 4|produce | 4|
| 96| 20|deli | 1|
This is the code I've used to arrive at this object:
temp5 <- opt %>%
left_join(products,by="product_id")%>%
left_join(departments,by="department_id") %>%
group_by(order_id,department_id,department) %>%
tally() %>%
group_by(department_id)
kable(head(temp5,14))
As you can see, the object contains, departments present in each Order_id. Now, what I want to do is, I want to count the number of departments for each order_id
i tried using the summarise() method in the dplyr package, but it throws the following error:
Error in summarise_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "factor".
It seems so simple, but cant fig out how to do it. Any help will be appreciated.
Edit: This is the code that I tried to run, post which I read about the count() function in the plyr package, i tried to use that as well, but that is of no use as it needs a data frame as input, whereas I only want to count the no of occurrences in the data frame
temp5 <- opt %>%
+ left_join(products,by="product_id")%>%
+ left_join(departments,by="department_id") %>%
+ group_by(order_id,department_id,department) %>%
+ tally() %>%
+ group_by(department_id) %>%
+ summarise(count(department))
In the output, I need to know the average no. of departments ordered from in each order id, so i need something like this:
Order_id | no. of departments
1 3
36 4
38 4
96 3
And then I should be able to plot using ggplot, no. of orders vs no. of departments in each order. Hope this is clear

Dividing the time into periods each 30 min

I have Dataframe contains "time" column I want to add a new column contain period number after dividing the time into periods each 30 min
for example,
The original Dataframe
l = [('A','2017-01-13 00:30:00'),('A','2017-01-13 00:00:01'),('E','2017-01-13 14:00:00'),('E','2017-01-13 12:08:15')]
df = spark.createDataFrame(l,['test','time'])
df1 = df.select(df.test,df.time.cast('timestamp'))
df1.show()
+----+-------------------+
|test| time|
+----+-------------------+
| A|2017-01-13 00:30:00|
| A|2017-01-13 00:00:01|
| E|2017-01-13 14:00:00|
| E|2017-01-13 12:08:15|
+----+-------------------+
The Desired Dataframe as follow:
+----+-------------------+------+
|test| time|period|
+----+-------------------+------+
| A|2017-01-13 00:30:00| 2|
| A|2017-01-13 00:00:01| 1|
| E|2017-01-13 14:00:00| 29|
| E|2017-01-13 12:08:15| 25|
+----+-------------------+------+
Are there ways to achieve that?
You can simply utilize the hour and minute inbuilt functions to get your final result with when inbuilt function as
from pyspark.sql import functions as F
df1.withColumn('period', (F.hour(df1['time'])*2)+1+(F.when(F.minute(df1['time']) >= 30, 1).otherwise(0))).show(truncate=False)
You should be getting
+----+---------------------+------+
|test|time |period|
+----+---------------------+------+
|A |2017-01-13 00:30:00.0|2 |
|A |2017-01-13 00:00:01.0|1 |
|E |2017-01-13 14:00:00.0|29 |
|E |2017-01-13 12:08:15.0|25 |
+----+---------------------+------+
I hope the answer is helpful

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

How can I calculate the correlation of my residuals ? Spark-Scala

I need to know if my residuals are correlated or not. I didn't find a way to do it using Spark-Scala on Databricks.
And i conclude that i should export my project to R to use acf function.
Does someone know a trick to do it using Spark-Scala on Databricks ?
For those who need more information : I'm currently working on Sales Forecasting. I used a Regression Forest using different features. Then, I need to evaluate the quality of my forecast. To check this, i read on this paper that residuals were a good way to see if your forecasting model is good or bad. In any cases, you can still improve it but it's just to make my opinion on my forecast model and compared it to others models.
Currently, I have one dataframe like the one below. It's a part of the testing data/out-of-sample data. (I casted prediction and residuals to IntegerType, that's why at the 3rd row 40 - 17 = 22)
I am using Spark 2.1.1.
You can find correlation between columns using spark ml library function
Lets first import the classes.
import org.apache.spark.sql.functions.corr
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
Now prepare the input DataFrame :
scala> val seqRow = Seq(
| ("2017-04-27",13,21),
| ("2017-04-26",7,16),
| ("2017-04-25",40,17),
| ("2017-04-24",17,17),
| ("2017-04-21",10,20),
| ("2017-04-20",9,19),
| ("2017-04-19",30,30),
| ("2017-04-18",18,25),
| ("2017-04-14",32,28),
| ("2017-04-13",39,18),
| ("2017-04-12",2,4),
| ("2017-04-11",8,24),
| ("2017-04-10",18,27),
| ("2017-04-07",6,17),
| ("2017-04-06",13,29),
| ("2017-04-05",10,17),
| ("2017-04-04",6,8),
| ("2017-04-03",20,32)
| )
seqRow: Seq[(String, Int, Int)] = List((2017-04-27,13,21), (2017-04-26,7,16), (2017-04-25,40,17), (2017-04-24,17,17), (2017-04-21,10,20), (2017-04-20,9,19), (2017-04-19,30,30), (2017-04-18,18,25), (2017-04-14,32,28), (2017-04-13,39,18), (2017-04-12,2,4), (2017-04-11,8,24), (2017-04-10,18,27), (2017-04-07,6,17), (2017-04-06,13,29), (2017-04-05,10,17), (2017-04-04,6,8), (2017-04-03,20,32))
scala> val rdd = sc.parallelize(seqRow)
rdd: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[51] at parallelize at <console>:34
scala> val input_df = spark.createDataFrame(rdd).toDF("date","amount","prediction").withColumn("residuals",'amount - 'prediction)
input_df: org.apache.spark.sql.DataFrame = [date: string, amount: int ... 2 more fields]
scala> input_df.show(false)
+----------+------+----------+---------+
|date |amount|prediction|residuals|
+----------+------+----------+---------+
|2017-04-27|13 |21 |-8 |
|2017-04-26|7 |16 |-9 |
|2017-04-25|40 |17 |23 |
|2017-04-24|17 |17 |0 |
|2017-04-21|10 |20 |-10 |
|2017-04-20|9 |19 |-10 |
|2017-04-19|30 |30 |0 |
|2017-04-18|18 |25 |-7 |
|2017-04-14|32 |28 |4 |
|2017-04-13|39 |18 |21 |
|2017-04-12|2 |4 |-2 |
|2017-04-11|8 |24 |-16 |
|2017-04-10|18 |27 |-9 |
|2017-04-07|6 |17 |-11 |
|2017-04-06|13 |29 |-16 |
|2017-04-05|10 |17 |-7 |
|2017-04-04|6 |8 |-2 |
|2017-04-03|20 |32 |-12 |
+----------+------+----------+---------+
The values of residuals for row 2017-04-14 and 2017-04-13 don't match as i am subtracting amount - prediction for residuals
Now proceeding forward to calculate correlation between all the columns.
This method is used for calculating correlation if number of columns are more and you need correlation between each column to others.
First we drop the column whose correlation is not to be calculated
scala> val drop_date_df = input_df.drop('date)
drop_date_df: org.apache.spark.sql.DataFrame = [amount: int, prediction: int ... 1 more field]
scala> drop_date_df.show
+------+----------+---------+
|amount|prediction|residuals|
+------+----------+---------+
| 13| 21| -8|
| 7| 16| -9|
| 40| 17| 23|
| 17| 17| 0|
| 10| 20| -10|
| 9| 19| -10|
| 30| 30| 0|
| 18| 25| -7|
| 32| 28| 4|
| 39| 18| 21|
| 2| 4| -2|
| 8| 24| -16|
| 18| 27| -9|
| 6| 17| -11|
| 13| 29| -16|
| 10| 17| -7|
| 6| 8| -2|
| 20| 32| -12|
+------+----------+---------+
Since there are more than 2 column for correlation, we need to find correlation matrix.
For calculating correlation matrix we need RDD[Vector] as you can see in spark example for correlation.
scala> val dense_rdd = drop_date_df.rdd.map{row =>
| val first = row.getAs[Integer]("amount")
| val second = row.getAs[Integer]("prediction")
| val third = row.getAs[Integer]("residuals")
| Vectors.dense(first.toDouble,second.toDouble,third.toDouble)}
dense_rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[62] at map at <console>:40
scala> val correlMatrix: Matrix = Statistics.corr(dense_rdd, "pearson")
correlMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0 0.40467032516705076 0.782939330961529
0.40467032516705076 1.0 -0.2520531290688281
0.782939330961529 -0.2520531290688281 1.0
The order of column remains same but you loose out the column names.
You can find good resources about structure of correlation matrix.
Since you want to find the correlation of residuals with other two columns.
We can explore other options
Hive corr UDAF
scala> drop_date_df.createOrReplaceTempView("temp_table")
scala> val corr_query_df = spark.sql("select corr(amount,residuals) as amount_residuals_corr,corr(prediction,residuals) as prediction_residual_corr from temp_table")
corr_query_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_query_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+
Spark corr function link
scala> val corr_df = drop_date_df.select(
| corr('amount,'residuals).as("amount_residuals_corr"),
| corr('prediction,'residuals).as("prediction_residual_corr"))
corr_df: org.apache.spark.sql.DataFrame = [amount_residuals_corr: double, prediction_residual_corr: double]
scala> corr_df.show
+---------------------+------------------------+
|amount_residuals_corr|prediction_residual_corr|
+---------------------+------------------------+
| 0.7829393309615287| -0.252053129068828|
+---------------------+------------------------+

Resources