have a sql table with the columns and want to import it into the sql table, at the same time remove the hypnens under the column and the last delimter. please help.
timestamp |cp |type |count |fail_count |succ |fail |
-------------------|---|--------|-----------|-----------|-----------|-----------|
2014.12.15 00:00:00| 1| 5| 5| 0| 143| 0|
2014.12.15 01:00:00| 1| 5| 30| 0| 945| 0|
2014.12.15 02:00:00| 1| 5| 30| 0| 1055| 0|
2014.12.15 03:00:00| 1| 5| 24| 0| 816| 0|
2014.12.15 04:00:00| 1| 5| 28| 0| 882| 0|
2014.12.15 05:00:00| 1| 5| 6| 0| 155| 0|
2014.12.15 06:00:00| 1| 5| 12| 0| 236| 0|
Related
I am currently trying to create a column that reflects a sequence from a recursive hierarchy in Pyspark. This is how the data looks like.
data = [(1,"A",None),(1,"B","A"),(1,"C","A"),(1,"D","C"),(1,"E","B"),(2,"A",None),(2,"B",None),(2,"C","A"),(2,"D","A"),(2,"E","D")]
df = spark.createDataFrame(data, "ID integer, Child string, Parent string")
+---+-----+------+
| ID|Child|Parent|
+---+-----+------+
| 1| A| null|
| 1| B| A|
| 1| C| A|
| 1| D| C|
| 1| E| B|
| 2| A| null|
| 2| B| null|
| 2| C| A|
| 2| D| A|
| 2| E| D|
+---+-----+------+
The expected result:
+---+-----+------+--------+
| ID|Child|Parent|Sequence|
+---+-----+------+--------+
| 1| A| null| 1|
| 1| B| A| 2|
| 1| C| A| 2|
| 1| D| C| 3|
| 1| E| B| 3|
| 2| A| null| 1|
| 2| B| null| 0|
| 2| C| A| 2|
| 2| D| A| 2|
| 2| E| D| 3|
+---+-----+------+--------+
What would be the best way to approach this?
I am aware that in SQL you can do this with recursive CTE, but there is no similar way to do it via Pyspark according to my investigation.
Recursively joining Dataframes seems to be the way to accomplish this, however it does seem expensive and overcomplex.
Is there a more native/efficient way to accomplish this?
I am new to PySpark and I am trying to convert couple of columns with double datatype to binary and I want to count number of non-zero values in the binary number / get sum of binary number digits
My sample data looks as follows
bit_1 bit_2 bit_3 bit_4 bit_5 bit_6
0 2 8 0 0 0
11 0 16 64 0 0
10 0 0 0 256 144
12 15 15 0 0 0
20 0 17 0 0 0
250 12 0 0 0 0
300 72 84 64 0 0
320 100 120 140 220 240
so far I tried below
test_df = df.withColumn('bit_sum', sum(map(int,"{0:b}".format(F.col('bit_1')))))
above code throws me error
I even tried below
df_2 = (df
.withColumn('bit_1_bi', F.lpad(F.bin(F.col('bit_1')),12,'0'))
.withColumn('bit_2_bi', F.lpad(F.bin(F.col('bit_2')),12,'0'))
.withColumn('bit_3_bi', F.lpad(F.bin(F.col('bit_3')),12,'0'))
.withColumn('bit_4_bi', F.lpad(F.bin(F.col('bit_4')),12,'0'))
.withColumn('bit_5_bi', F.lpad(F.bin(F.col('bit_5')),12,'0'))
.withColumn('bit_6_bi', F.lpad(F.bin(F.col('bit_6')),12,'0'))
)
Let us use bin to convert the column values into binary string representation, then replace 0's with empty string and count the length of resulting string to calculate number of 1's
df.select(*[F.length(F.regexp_replace(F.bin(c), '0', '')).alias(c) for c in df.columns])
+-----+-----+-----+-----+-----+-----+
|bit_1|bit_2|bit_3|bit_4|bit_5|bit_6|
+-----+-----+-----+-----+-----+-----+
| 0| 1| 1| 0| 0| 0|
| 3| 0| 1| 1| 0| 0|
| 2| 0| 0| 0| 1| 2|
| 2| 4| 4| 0| 0| 0|
| 2| 0| 2| 0| 0| 0|
| 6| 2| 0| 0| 0| 0|
| 4| 2| 3| 1| 0| 0|
| 2| 3| 4| 3| 5| 4|
+-----+-----+-----+-----+-----+-----+
I have a grouped box plot in which I want to change the outlier dots from the default of black to the colour of the boxes keeping everything else the same. There is a previous thread that provides a solution for this for a standard box plot that I am able to implement.
Coloring boxplot outlier points in ggplot2?
However, I want to do it for a grouped box plot.
Below is some example data and code for the grouped box plot.
|ID |Time |Metabolite | Concentration|
|:--|:----|:----------|-------------:|
|1 |1 |A | 40|
|1 |1 |B | 36|
|1 |1 |C | 28|
|1 |2 |A | 13|
|1 |2 |B | 150|
|1 |2 |C | 32|
|1 |3 |A | 45|
|1 |3 |B | 15|
|1 |3 |C | 15|
|2 |1 |A | 7|
|2 |1 |A | 9|
|2 |1 |B | 236|
|2 |1 |C | 33|
|2 |2 |A | 33|
|2 |2 |B | 48|
|2 |2 |C | 39|
|2 |3 |A | 15|
|2 |3 |C | 126|
|3 |1 |A | 13|
|3 |1 |B | 41|
|3 |1 |C | 37|
|3 |2 |A | 3|
|3 |2 |B | 218|
|3 |2 |C | 27|
|3 |3 |A | 7|
|3 |3 |B | 27|
|3 |3 |C | 3|
|4 |1 |A | 4|
|4 |1 |B | 7|
|4 |1 |C | 33|
|4 |2 |A | 133|
|4 |2 |B | 4|
|4 |2 |C | 10|
|4 |3 |A | 122|
|4 |3 |B | 27|
|4 |3 |C | 14|
|5 |1 |A | 7|
|5 |1 |B | 22|
|5 |1 |C | 43|
|5 |2 |A | 3|
|5 |2 |B | 6|
|5 |2 |C | 158|
|5 |3 |A | 48|
|5 |3 |B | 7|
|5 |3 |C | 24|
|6 |1 |A | 15|
|6 |1 |B | 30|
|6 |1 |C | 15|
|6 |2 |A | 27|
|6 |2 |B | 187|
|6 |2 |C | 9|
|6 |3 |A | 31|
|6 |3 |B | 40|
|6 |3 |C | 41|
|7 |1 |A | 37|
|7 |1 |B | 30|
|7 |1 |C | 28|
|7 |2 |A | 142|
|7 |2 |B | 40|
|7 |2 |C | 7|
|7 |3 |A | 45|
|7 |3 |B | 3|
|8 |3 |C | 45|
|8 |1 |A | 34|
|8 |1 |B | 8|
|8 |1 |C | 46|
|8 |2 |A | 167|
|8 |2 |B | 25|
|8 |2 |C | 34|
|8 |3 |A | 27|
|9 |3 |B | 28|
|9 |3 |C | 36|
|9 |1 |A | 44|
|9 |1 |B | 26|
|9 |1 |C | 20|
|9 |2 |A | 11|
|9 |2 |B | 18|
|9 |2 |C | 176|
|9 |3 |A | 1|
|9 |3 |B | 40|
|9 |3 |C | 10|
|10 |1 |A | 8|
|10 |1 |B | 49|
|10 |1 |C | 193|
|10 |2 |A | 13|
|10 |2 |B | 13|
|10 |2 |C | 28|
|10 |3 |A | 50|
|10 |3 |B | 47|
|10 |3 |C | 46|
|11 |1 |A | 21|
|11 |1 |B | 34|
|11 |1 |C | 28|
|11 |2 |A | 13|
|11 |2 |B | 32|
|11 |2 |C | 47|
|11 |3 |A | 15|
|11 |3 |B | 42|
|11 |3 |C | 9|
ggplot(df, aes(x=Time, y=Concentration, fill=Metabolite)) +
geom_boxplot()
I am reading an api using R and was able to retrieve the data, however, the data itself is retrieved in JSON format with what seems to be in 3 sections.
Converting the json data to a df using
fromJSON(data)
It was successful.
The next step was to drop the unnecessary columns, which is the my current issue and not able drop any columns to which I just recently found out how the data is formatted in the data frame.
lead <- GET("some api", add_headers(Authorization="Bearer some apikey"))
data <- content(lead,"text") #display the data looks something like below
The data seems to be formatted into 3 sections where I believe the issue is preventing columns to be dropped
$'data'
$included
$link
I've looked at multiple stackoverflow posts and r guides on dropping column name in df, but was unsuccessful
> df <- df[,-1]
Error in df[, -1] : incorrect number of dimensions
> df <- subset(df, select = -c("$'data'$id"))
Error in subset.default(df, select = -c("$'data'$id")) :
argument "subset" is missing, with no default
The end goal is to drop everything in $links, a few columns in $included, and a few columns in $data
Completely remove links:
df$links <- NULL
Remove columns 1 and 2 from included:
df$included[,c(1, 2)] <- NULL
Remove columns 3 and 4 from data:
df$data[,c(3, 4)] <- NULL
For a working example, see below with iris and mtcars:
exr <- list(iris = iris, mtcars = mtcars)
head(exr$mtcars)
| | mpg| cyl| disp| hp| drat| wt| qsec| vs| am| gear| carb|
|:-----------------|----:|---:|----:|---:|----:|-----:|-----:|--:|--:|----:|----:|
|Mazda RX4 | 21.0| 6| 160| 110| 3.90| 2.620| 16.46| 0| 1| 4| 4|
|Mazda RX4 Wag | 21.0| 6| 160| 110| 3.90| 2.875| 17.02| 0| 1| 4| 4|
|Datsun 710 | 22.8| 4| 108| 93| 3.85| 2.320| 18.61| 1| 1| 4| 1|
|Hornet 4 Drive | 21.4| 6| 258| 110| 3.08| 3.215| 19.44| 1| 0| 3| 1|
|Hornet Sportabout | 18.7| 8| 360| 175| 3.15| 3.440| 17.02| 0| 0| 3| 2|
|Valiant | 18.1| 6| 225| 105| 2.76| 3.460| 20.22| 1| 0| 3| 1|
exr$mtcars[,c(1,2)] <- NULL
head(exr$mtcars)
| | disp| hp| drat| wt| qsec| vs| am| gear| carb|
|:-----------------|----:|---:|----:|-----:|-----:|--:|--:|----:|----:|
|Mazda RX4 | 160| 110| 3.90| 2.620| 16.46| 0| 1| 4| 4|
|Mazda RX4 Wag | 160| 110| 3.90| 2.875| 17.02| 0| 1| 4| 4|
|Datsun 710 | 108| 93| 3.85| 2.320| 18.61| 1| 1| 4| 1|
|Hornet 4 Drive | 258| 110| 3.08| 3.215| 19.44| 1| 0| 3| 1|
|Hornet Sportabout | 360| 175| 3.15| 3.440| 17.02| 0| 0| 3| 2|
|Valiant | 225| 105| 2.76| 3.460| 20.22| 1| 0| 3| 1|
If I parse do.call(what=knitr::kable,args=args) the function kable in do.call is parsed to as a SYMBOL and not as a SYMBOL_FUNCTION_CALL.
Why shouldn't it be the later?
tf <- tempfile()
cat('do.call(knitr::kable,args=args)',file = tf)
parsed <- utils::getParseData(parse(tf))
knitr::kable(parsed)
| | line1| col1| line2| col2| id| parent|token |terminal |text |
|:--|-----:|----:|-----:|----:|--:|------:|:--------------------|:--------|:-------|
|18 | 1| 1| 1| 31| 18| 0|expr |FALSE | |
|1 | 1| 1| 1| 7| 1| 3|SYMBOL_FUNCTION_CALL |TRUE |do.call |
|3 | 1| 1| 1| 7| 3| 18|expr |FALSE | |
|2 | 1| 8| 1| 8| 2| 18|'(' |TRUE |( |
|7 | 1| 9| 1| 20| 7| 18|expr |FALSE | |
|4 | 1| 9| 1| 13| 4| 7|SYMBOL_PACKAGE |TRUE |knitr |
|5 | 1| 14| 1| 15| 5| 7|NS_GET |TRUE |:: |
|6 | 1| 16| 1| 20| 6| 7|SYMBOL |TRUE |kable |
|8 | 1| 21| 1| 21| 8| 18|',' |TRUE |, |
|11 | 1| 22| 1| 25| 11| 18|SYMBOL_SUB |TRUE |args |
|12 | 1| 26| 1| 26| 12| 18|EQ_SUB |TRUE |= |
|13 | 1| 27| 1| 30| 13| 15|SYMBOL |TRUE |args |
|15 | 1| 27| 1| 30| 15| 18|expr |FALSE | |
|14 | 1| 31| 1| 31| 14| 18|')' |TRUE |) |
If you just have ktable its a symbol. That symbol could point to a function or a value. It's not clear until you actually evaluate it what it is.
However if you have ktable(), it's clear that you expect ktable to be a function and that you are calling it.
The do.call obscures the parser's ability to recognize that you are trying to call a function and that intention isn't realized till run-time.
Things can get funny if you do something like
sum <- 5
sum
# [1] 5
sum(1:3)
# [1] 6
Here sum is behaving both like a regular variable and a function. We've actually created a shadow variable in our global environment that masks the sum function from base. But because the parse treats sum and sum() differently we can still get at both meanings.