Get sequence column from recursive hierarchy in Pyspark - recursion

I am currently trying to create a column that reflects a sequence from a recursive hierarchy in Pyspark. This is how the data looks like.
data = [(1,"A",None),(1,"B","A"),(1,"C","A"),(1,"D","C"),(1,"E","B"),(2,"A",None),(2,"B",None),(2,"C","A"),(2,"D","A"),(2,"E","D")]
df = spark.createDataFrame(data, "ID integer, Child string, Parent string")
+---+-----+------+
| ID|Child|Parent|
+---+-----+------+
| 1| A| null|
| 1| B| A|
| 1| C| A|
| 1| D| C|
| 1| E| B|
| 2| A| null|
| 2| B| null|
| 2| C| A|
| 2| D| A|
| 2| E| D|
+---+-----+------+
The expected result:
+---+-----+------+--------+
| ID|Child|Parent|Sequence|
+---+-----+------+--------+
| 1| A| null| 1|
| 1| B| A| 2|
| 1| C| A| 2|
| 1| D| C| 3|
| 1| E| B| 3|
| 2| A| null| 1|
| 2| B| null| 0|
| 2| C| A| 2|
| 2| D| A| 2|
| 2| E| D| 3|
+---+-----+------+--------+
What would be the best way to approach this?
I am aware that in SQL you can do this with recursive CTE, but there is no similar way to do it via Pyspark according to my investigation.
Recursively joining Dataframes seems to be the way to accomplish this, however it does seem expensive and overcomplex.
Is there a more native/efficient way to accomplish this?

Related

Comparing each components in two matrix with for statement

I am comparing each component in two matrices and pairs of dimnames.
Below is example matrices
mat1
| | D1| D2| D3| D4|
| D1 | NA| 1| 2| 2|
| D2 | NA| NA| 3| 8|
| D3 | NA| NA| NA| 8|
| D4 | NA| NA| NA| NA|
mat2
| | D1| D2| D3| D4|
| D1 | NA| 2| 4| 1|
| D2 | NA| NA| 13| 10|
| D3 | NA| NA| NA| 5|
| D4 | NA| NA| NA| NA|
I tried to compare it using "for" and get pairs with code below.
check = c()
for(i in 2:4){
for(j in 1:i-1){
if(mat1[j,i] < mat2[j,i])
{check= append(check, 100*j + i}
}
}
But comes out with this error messages
Error in if (mat1[j, i] < mat2[j, i]) { :
argument is of length zero)
I can understand that this error comes from here,
> mat1[j, i] < mat2[j, i]
logical(0)
But don't know how to solve it.
Or any other approaches to solve this problem would be super helpful thank you!
Comparing each components in two matrix with for statement

Repeat row of data table N times and join to another table in R

I have two data tables:
> dt1
+---+---+
| V1| V2|
+---+---+
| a| a|
| a| b|
| a| c|
| b| c|
| c| d|
+---+---+
> dt2
+------------------+
| id|
+------------------+
| c(4, 98, 56, 32)|
+------------------+
dt2 has the following structure:
> str(dt2)
Classes ‘data.table’ and 'data.frame': 1 obs. of 1 variable:
$ id:List of 1
..$ : int 4 98 56 32
- attr(*, ".internal.selfref")=<externalptr>
I am looking for the most efficient way (if possible, based on the data.table approach) to add column values from the df2 as constant value to each row of the df1.
Expected result:
+---+---+------------------+
| V1| V2| id|
+---+---+------------------+
| a| a| c(4, 98, 56, 32)|
| a| b| c(4, 98, 56, 32)|
| a| c| c(4, 98, 56, 32)|
| b| c| c(4, 98, 56, 32)|
| c| d| c(4, 98, 56, 32)|
+---+---+------------------+
Do you mean something like this?
dt1[, id:=dt2$id]
Output:
|V1 |V2 |id |
|:--|:--|:-------------|
|a |a |c(4,98,56,32) |
|a |b |c(4,98,56,32) |
|a |c |c(4,98,56,32) |
|b |c |c(4,98,56,32) |
|c |d |c(4,98,56,32) |

Change outliers from black to colour in grouped box plot in ggplot2

I have a grouped box plot in which I want to change the outlier dots from the default of black to the colour of the boxes keeping everything else the same. There is a previous thread that provides a solution for this for a standard box plot that I am able to implement.
Coloring boxplot outlier points in ggplot2?
However, I want to do it for a grouped box plot.
Below is some example data and code for the grouped box plot.
|ID |Time |Metabolite | Concentration|
|:--|:----|:----------|-------------:|
|1 |1 |A | 40|
|1 |1 |B | 36|
|1 |1 |C | 28|
|1 |2 |A | 13|
|1 |2 |B | 150|
|1 |2 |C | 32|
|1 |3 |A | 45|
|1 |3 |B | 15|
|1 |3 |C | 15|
|2 |1 |A | 7|
|2 |1 |A | 9|
|2 |1 |B | 236|
|2 |1 |C | 33|
|2 |2 |A | 33|
|2 |2 |B | 48|
|2 |2 |C | 39|
|2 |3 |A | 15|
|2 |3 |C | 126|
|3 |1 |A | 13|
|3 |1 |B | 41|
|3 |1 |C | 37|
|3 |2 |A | 3|
|3 |2 |B | 218|
|3 |2 |C | 27|
|3 |3 |A | 7|
|3 |3 |B | 27|
|3 |3 |C | 3|
|4 |1 |A | 4|
|4 |1 |B | 7|
|4 |1 |C | 33|
|4 |2 |A | 133|
|4 |2 |B | 4|
|4 |2 |C | 10|
|4 |3 |A | 122|
|4 |3 |B | 27|
|4 |3 |C | 14|
|5 |1 |A | 7|
|5 |1 |B | 22|
|5 |1 |C | 43|
|5 |2 |A | 3|
|5 |2 |B | 6|
|5 |2 |C | 158|
|5 |3 |A | 48|
|5 |3 |B | 7|
|5 |3 |C | 24|
|6 |1 |A | 15|
|6 |1 |B | 30|
|6 |1 |C | 15|
|6 |2 |A | 27|
|6 |2 |B | 187|
|6 |2 |C | 9|
|6 |3 |A | 31|
|6 |3 |B | 40|
|6 |3 |C | 41|
|7 |1 |A | 37|
|7 |1 |B | 30|
|7 |1 |C | 28|
|7 |2 |A | 142|
|7 |2 |B | 40|
|7 |2 |C | 7|
|7 |3 |A | 45|
|7 |3 |B | 3|
|8 |3 |C | 45|
|8 |1 |A | 34|
|8 |1 |B | 8|
|8 |1 |C | 46|
|8 |2 |A | 167|
|8 |2 |B | 25|
|8 |2 |C | 34|
|8 |3 |A | 27|
|9 |3 |B | 28|
|9 |3 |C | 36|
|9 |1 |A | 44|
|9 |1 |B | 26|
|9 |1 |C | 20|
|9 |2 |A | 11|
|9 |2 |B | 18|
|9 |2 |C | 176|
|9 |3 |A | 1|
|9 |3 |B | 40|
|9 |3 |C | 10|
|10 |1 |A | 8|
|10 |1 |B | 49|
|10 |1 |C | 193|
|10 |2 |A | 13|
|10 |2 |B | 13|
|10 |2 |C | 28|
|10 |3 |A | 50|
|10 |3 |B | 47|
|10 |3 |C | 46|
|11 |1 |A | 21|
|11 |1 |B | 34|
|11 |1 |C | 28|
|11 |2 |A | 13|
|11 |2 |B | 32|
|11 |2 |C | 47|
|11 |3 |A | 15|
|11 |3 |B | 42|
|11 |3 |C | 9|
ggplot(df, aes(x=Time, y=Concentration, fill=Metabolite)) +
geom_boxplot()

r parser translating symbol_function_call as a symbol

If I parse do.call(what=knitr::kable,args=args) the function kable in do.call is parsed to as a SYMBOL and not as a SYMBOL_FUNCTION_CALL.
Why shouldn't it be the later?
tf <- tempfile()
cat('do.call(knitr::kable,args=args)',file = tf)
parsed <- utils::getParseData(parse(tf))
knitr::kable(parsed)
| | line1| col1| line2| col2| id| parent|token |terminal |text |
|:--|-----:|----:|-----:|----:|--:|------:|:--------------------|:--------|:-------|
|18 | 1| 1| 1| 31| 18| 0|expr |FALSE | |
|1 | 1| 1| 1| 7| 1| 3|SYMBOL_FUNCTION_CALL |TRUE |do.call |
|3 | 1| 1| 1| 7| 3| 18|expr |FALSE | |
|2 | 1| 8| 1| 8| 2| 18|'(' |TRUE |( |
|7 | 1| 9| 1| 20| 7| 18|expr |FALSE | |
|4 | 1| 9| 1| 13| 4| 7|SYMBOL_PACKAGE |TRUE |knitr |
|5 | 1| 14| 1| 15| 5| 7|NS_GET |TRUE |:: |
|6 | 1| 16| 1| 20| 6| 7|SYMBOL |TRUE |kable |
|8 | 1| 21| 1| 21| 8| 18|',' |TRUE |, |
|11 | 1| 22| 1| 25| 11| 18|SYMBOL_SUB |TRUE |args |
|12 | 1| 26| 1| 26| 12| 18|EQ_SUB |TRUE |= |
|13 | 1| 27| 1| 30| 13| 15|SYMBOL |TRUE |args |
|15 | 1| 27| 1| 30| 15| 18|expr |FALSE | |
|14 | 1| 31| 1| 31| 14| 18|')' |TRUE |) |
If you just have ktable its a symbol. That symbol could point to a function or a value. It's not clear until you actually evaluate it what it is.
However if you have ktable(), it's clear that you expect ktable to be a function and that you are calling it.
The do.call obscures the parser's ability to recognize that you are trying to call a function and that intention isn't realized till run-time.
Things can get funny if you do something like
sum <- 5
sum
# [1] 5
sum(1:3)
# [1] 6
Here sum is behaving both like a regular variable and a function. We've actually created a shadow variable in our global environment that masks the sum function from base. But because the parse treats sum and sum() differently we can still get at both meanings.

importing data from log to sql table with delimter

have a sql table with the columns and want to import it into the sql table, at the same time remove the hypnens under the column and the last delimter. please help.
timestamp |cp |type |count |fail_count |succ |fail |
-------------------|---|--------|-----------|-----------|-----------|-----------|
2014.12.15 00:00:00| 1| 5| 5| 0| 143| 0|
2014.12.15 01:00:00| 1| 5| 30| 0| 945| 0|
2014.12.15 02:00:00| 1| 5| 30| 0| 1055| 0|
2014.12.15 03:00:00| 1| 5| 24| 0| 816| 0|
2014.12.15 04:00:00| 1| 5| 28| 0| 882| 0|
2014.12.15 05:00:00| 1| 5| 6| 0| 155| 0|
2014.12.15 06:00:00| 1| 5| 12| 0| 236| 0|

Resources