I am reading an api using R and was able to retrieve the data, however, the data itself is retrieved in JSON format with what seems to be in 3 sections.
Converting the json data to a df using
fromJSON(data)
It was successful.
The next step was to drop the unnecessary columns, which is the my current issue and not able drop any columns to which I just recently found out how the data is formatted in the data frame.
lead <- GET("some api", add_headers(Authorization="Bearer some apikey"))
data <- content(lead,"text") #display the data looks something like below
The data seems to be formatted into 3 sections where I believe the issue is preventing columns to be dropped
$'data'
$included
$link
I've looked at multiple stackoverflow posts and r guides on dropping column name in df, but was unsuccessful
> df <- df[,-1]
Error in df[, -1] : incorrect number of dimensions
> df <- subset(df, select = -c("$'data'$id"))
Error in subset.default(df, select = -c("$'data'$id")) :
argument "subset" is missing, with no default
The end goal is to drop everything in $links, a few columns in $included, and a few columns in $data
Completely remove links:
df$links <- NULL
Remove columns 1 and 2 from included:
df$included[,c(1, 2)] <- NULL
Remove columns 3 and 4 from data:
df$data[,c(3, 4)] <- NULL
For a working example, see below with iris and mtcars:
exr <- list(iris = iris, mtcars = mtcars)
head(exr$mtcars)
| | mpg| cyl| disp| hp| drat| wt| qsec| vs| am| gear| carb|
|:-----------------|----:|---:|----:|---:|----:|-----:|-----:|--:|--:|----:|----:|
|Mazda RX4 | 21.0| 6| 160| 110| 3.90| 2.620| 16.46| 0| 1| 4| 4|
|Mazda RX4 Wag | 21.0| 6| 160| 110| 3.90| 2.875| 17.02| 0| 1| 4| 4|
|Datsun 710 | 22.8| 4| 108| 93| 3.85| 2.320| 18.61| 1| 1| 4| 1|
|Hornet 4 Drive | 21.4| 6| 258| 110| 3.08| 3.215| 19.44| 1| 0| 3| 1|
|Hornet Sportabout | 18.7| 8| 360| 175| 3.15| 3.440| 17.02| 0| 0| 3| 2|
|Valiant | 18.1| 6| 225| 105| 2.76| 3.460| 20.22| 1| 0| 3| 1|
exr$mtcars[,c(1,2)] <- NULL
head(exr$mtcars)
| | disp| hp| drat| wt| qsec| vs| am| gear| carb|
|:-----------------|----:|---:|----:|-----:|-----:|--:|--:|----:|----:|
|Mazda RX4 | 160| 110| 3.90| 2.620| 16.46| 0| 1| 4| 4|
|Mazda RX4 Wag | 160| 110| 3.90| 2.875| 17.02| 0| 1| 4| 4|
|Datsun 710 | 108| 93| 3.85| 2.320| 18.61| 1| 1| 4| 1|
|Hornet 4 Drive | 258| 110| 3.08| 3.215| 19.44| 1| 0| 3| 1|
|Hornet Sportabout | 360| 175| 3.15| 3.440| 17.02| 0| 0| 3| 2|
|Valiant | 225| 105| 2.76| 3.460| 20.22| 1| 0| 3| 1|
Related
I am currently trying to create a column that reflects a sequence from a recursive hierarchy in Pyspark. This is how the data looks like.
data = [(1,"A",None),(1,"B","A"),(1,"C","A"),(1,"D","C"),(1,"E","B"),(2,"A",None),(2,"B",None),(2,"C","A"),(2,"D","A"),(2,"E","D")]
df = spark.createDataFrame(data, "ID integer, Child string, Parent string")
+---+-----+------+
| ID|Child|Parent|
+---+-----+------+
| 1| A| null|
| 1| B| A|
| 1| C| A|
| 1| D| C|
| 1| E| B|
| 2| A| null|
| 2| B| null|
| 2| C| A|
| 2| D| A|
| 2| E| D|
+---+-----+------+
The expected result:
+---+-----+------+--------+
| ID|Child|Parent|Sequence|
+---+-----+------+--------+
| 1| A| null| 1|
| 1| B| A| 2|
| 1| C| A| 2|
| 1| D| C| 3|
| 1| E| B| 3|
| 2| A| null| 1|
| 2| B| null| 0|
| 2| C| A| 2|
| 2| D| A| 2|
| 2| E| D| 3|
+---+-----+------+--------+
What would be the best way to approach this?
I am aware that in SQL you can do this with recursive CTE, but there is no similar way to do it via Pyspark according to my investigation.
Recursively joining Dataframes seems to be the way to accomplish this, however it does seem expensive and overcomplex.
Is there a more native/efficient way to accomplish this?
I am new to PySpark and I am trying to convert couple of columns with double datatype to binary and I want to count number of non-zero values in the binary number / get sum of binary number digits
My sample data looks as follows
bit_1 bit_2 bit_3 bit_4 bit_5 bit_6
0 2 8 0 0 0
11 0 16 64 0 0
10 0 0 0 256 144
12 15 15 0 0 0
20 0 17 0 0 0
250 12 0 0 0 0
300 72 84 64 0 0
320 100 120 140 220 240
so far I tried below
test_df = df.withColumn('bit_sum', sum(map(int,"{0:b}".format(F.col('bit_1')))))
above code throws me error
I even tried below
df_2 = (df
.withColumn('bit_1_bi', F.lpad(F.bin(F.col('bit_1')),12,'0'))
.withColumn('bit_2_bi', F.lpad(F.bin(F.col('bit_2')),12,'0'))
.withColumn('bit_3_bi', F.lpad(F.bin(F.col('bit_3')),12,'0'))
.withColumn('bit_4_bi', F.lpad(F.bin(F.col('bit_4')),12,'0'))
.withColumn('bit_5_bi', F.lpad(F.bin(F.col('bit_5')),12,'0'))
.withColumn('bit_6_bi', F.lpad(F.bin(F.col('bit_6')),12,'0'))
)
Let us use bin to convert the column values into binary string representation, then replace 0's with empty string and count the length of resulting string to calculate number of 1's
df.select(*[F.length(F.regexp_replace(F.bin(c), '0', '')).alias(c) for c in df.columns])
+-----+-----+-----+-----+-----+-----+
|bit_1|bit_2|bit_3|bit_4|bit_5|bit_6|
+-----+-----+-----+-----+-----+-----+
| 0| 1| 1| 0| 0| 0|
| 3| 0| 1| 1| 0| 0|
| 2| 0| 0| 0| 1| 2|
| 2| 4| 4| 0| 0| 0|
| 2| 0| 2| 0| 0| 0|
| 6| 2| 0| 0| 0| 0|
| 4| 2| 3| 1| 0| 0|
| 2| 3| 4| 3| 5| 4|
+-----+-----+-----+-----+-----+-----+
If I parse do.call(what=knitr::kable,args=args) the function kable in do.call is parsed to as a SYMBOL and not as a SYMBOL_FUNCTION_CALL.
Why shouldn't it be the later?
tf <- tempfile()
cat('do.call(knitr::kable,args=args)',file = tf)
parsed <- utils::getParseData(parse(tf))
knitr::kable(parsed)
| | line1| col1| line2| col2| id| parent|token |terminal |text |
|:--|-----:|----:|-----:|----:|--:|------:|:--------------------|:--------|:-------|
|18 | 1| 1| 1| 31| 18| 0|expr |FALSE | |
|1 | 1| 1| 1| 7| 1| 3|SYMBOL_FUNCTION_CALL |TRUE |do.call |
|3 | 1| 1| 1| 7| 3| 18|expr |FALSE | |
|2 | 1| 8| 1| 8| 2| 18|'(' |TRUE |( |
|7 | 1| 9| 1| 20| 7| 18|expr |FALSE | |
|4 | 1| 9| 1| 13| 4| 7|SYMBOL_PACKAGE |TRUE |knitr |
|5 | 1| 14| 1| 15| 5| 7|NS_GET |TRUE |:: |
|6 | 1| 16| 1| 20| 6| 7|SYMBOL |TRUE |kable |
|8 | 1| 21| 1| 21| 8| 18|',' |TRUE |, |
|11 | 1| 22| 1| 25| 11| 18|SYMBOL_SUB |TRUE |args |
|12 | 1| 26| 1| 26| 12| 18|EQ_SUB |TRUE |= |
|13 | 1| 27| 1| 30| 13| 15|SYMBOL |TRUE |args |
|15 | 1| 27| 1| 30| 15| 18|expr |FALSE | |
|14 | 1| 31| 1| 31| 14| 18|')' |TRUE |) |
If you just have ktable its a symbol. That symbol could point to a function or a value. It's not clear until you actually evaluate it what it is.
However if you have ktable(), it's clear that you expect ktable to be a function and that you are calling it.
The do.call obscures the parser's ability to recognize that you are trying to call a function and that intention isn't realized till run-time.
Things can get funny if you do something like
sum <- 5
sum
# [1] 5
sum(1:3)
# [1] 6
Here sum is behaving both like a regular variable and a function. We've actually created a shadow variable in our global environment that masks the sum function from base. But because the parse treats sum and sum() differently we can still get at both meanings.
I would like to report descriptive values in a table (I am sure they should be in a table and not in a figure). The data comes from a 3-factorial experiment, so the table that I am able to produce with xtable (I'm doing it in an Rmarkdown and Knitr and have never used LaTex) contains one line per data value along the format:
group | condition | type | value
When all the lines are printed below each other, this in not very readable, for example the "group" entry remains the same for 10 lines. Is there a possibility to just print it the first time (in the first line) and then omit it until the "group" changes to the next group (only print it in line 11)?
My table should have apa-format, so I use either rapa::apa(mytable) or papaja::apa_table(mytable) for the final print.
Any help would be appreciated, thanks!
There are a few different ways to do this.
library(data.table)
dt = data.table("Group" = c(rep("A",4),rep("B",4)), "value" = rep(1:4, each = 2))
knitr::kable(dt)
> dt
Group value
1: A 1
2: A 1
3: A 2
4: A 2
5: B 3
6: B 3
7: B 4
8: B 4
We can remove duplicates across all rows
knitr::kable(dt[!duplicated(dt),])
|Group | value|
|:-----|-----:|
|A | 1|
|A | 2|
|B | 3|
|B | 4|
Or, we can remove duplicates according to specific rows
knitr::kable(unique(dt,by = c("Group")))
|Group | value|
|:-----|-----:|
|A | 1|
|B | 3|
Then, since that can match to multiple options we can specify which one we want to grab
knitr::kable(dt[unique(dt,by = c("Group")),.(Group, value), mult = "first"])
|Group | value|
|:-----|-----:|
|A | 1|
|B | 3|
knitr::kable(dt[unique(dt,by = c("Group")),.(Group, value), mult = "last"])
|Group | value|
|:-----|-----:|
|A | 2|
|B | 4|
EDIT
To not print values in a specific group that have been duplicated
dt$Group = ifelse(duplicated(dt$Group),"",dt$Group)
knitr::kable(dt)
|Group | value|
|:-----|-----:|
|A | 1|
| | 1|
| | 2|
| | 2|
|B | 3|
| | 3|
| | 4|
| | 4|
You can use duplicated function with negation (!) to retain values of "group" only at transitions but be careful that is does
not result in loss of information from other columns (if they are important). In the demo datset we retain only transitions of cyl variable.
mtcarsSubset = mtcars[,1:5]
knitr::kable(mtcarsSubset)
#| | mpg| cyl| disp| hp| drat|
#|:-------------------|----:|---:|-----:|---:|----:|
#|Mazda RX4 | 21.0| 6| 160.0| 110| 3.90|
#|Mazda RX4 Wag | 21.0| 6| 160.0| 110| 3.90|
#|Datsun 710 | 22.8| 4| 108.0| 93| 3.85|
#|Hornet 4 Drive | 21.4| 6| 258.0| 110| 3.08|
#|Hornet Sportabout | 18.7| 8| 360.0| 175| 3.15|
#|Valiant | 18.1| 6| 225.0| 105| 2.76|
#|Duster 360 | 14.3| 8| 360.0| 245| 3.21|
#|Merc 240D | 24.4| 4| 146.7| 62| 3.69|
#|Merc 230 | 22.8| 4| 140.8| 95| 3.92|
#|Merc 280 | 19.2| 6| 167.6| 123| 3.92|
#|Merc 280C | 17.8| 6| 167.6| 123| 3.92|
#|Merc 450SE | 16.4| 8| 275.8| 180| 3.07|
#|Merc 450SL | 17.3| 8| 275.8| 180| 3.07|
#|Merc 450SLC | 15.2| 8| 275.8| 180| 3.07|
#|Cadillac Fleetwood | 10.4| 8| 472.0| 205| 2.93|
#|Lincoln Continental | 10.4| 8| 460.0| 215| 3.00|
#|Chrysler Imperial | 14.7| 8| 440.0| 230| 3.23|
#|Fiat 128 | 32.4| 4| 78.7| 66| 4.08|
#|Honda Civic | 30.4| 4| 75.7| 52| 4.93|
#|Toyota Corolla | 33.9| 4| 71.1| 65| 4.22|
#|Toyota Corona | 21.5| 4| 120.1| 97| 3.70|
#|Dodge Challenger | 15.5| 8| 318.0| 150| 2.76|
#|AMC Javelin | 15.2| 8| 304.0| 150| 3.15|
#|Camaro Z28 | 13.3| 8| 350.0| 245| 3.73|
#|Pontiac Firebird | 19.2| 8| 400.0| 175| 3.08|
#|Fiat X1-9 | 27.3| 4| 79.0| 66| 4.08|
#|Porsche 914-2 | 26.0| 4| 120.3| 91| 4.43|
#|Lotus Europa | 30.4| 4| 95.1| 113| 3.77|
#|Ford Pantera L | 15.8| 8| 351.0| 264| 4.22|
#|Ferrari Dino | 19.7| 6| 145.0| 175| 3.62|
#|Maserati Bora | 15.0| 8| 301.0| 335| 3.54|
#|Volvo 142E | 21.4| 4| 121.0| 109| 4.11|
knitr::kable(mtcarsSubset[!duplicated(mtcarsSubset$cyl),])
#| | mpg| cyl| disp| hp| drat|
#|:-----------------|----:|---:|----:|---:|----:|
#|Mazda RX4 | 21.0| 6| 160| 110| 3.90|
#|Datsun 710 | 22.8| 4| 108| 93| 3.85|
#|Hornet Sportabout | 18.7| 8| 360| 175| 3.15|
Finally, I changed the data-frame that then is converted into a table.
ReplicationTable %>% mutate(dependent_variable = ifelse(duplicated(dependent_variable), "", dependent_variable)
This replaces all entries with an empty string after the first unique entry in dependent_variable. This also works in grouped data-frames.
have a sql table with the columns and want to import it into the sql table, at the same time remove the hypnens under the column and the last delimter. please help.
timestamp |cp |type |count |fail_count |succ |fail |
-------------------|---|--------|-----------|-----------|-----------|-----------|
2014.12.15 00:00:00| 1| 5| 5| 0| 143| 0|
2014.12.15 01:00:00| 1| 5| 30| 0| 945| 0|
2014.12.15 02:00:00| 1| 5| 30| 0| 1055| 0|
2014.12.15 03:00:00| 1| 5| 24| 0| 816| 0|
2014.12.15 04:00:00| 1| 5| 28| 0| 882| 0|
2014.12.15 05:00:00| 1| 5| 6| 0| 155| 0|
2014.12.15 06:00:00| 1| 5| 12| 0| 236| 0|