Sparklyr split strings

Sparklyr split strings - r

I have a file with several lines. For example
A B C
awer.ttp.net Code 554
abcd.ttp.net Code 747
asdf.ttp.net Part 554
xyz.ttp.net Part 747
I want to make a command in spark in R using sparklyr library with statement to split just column A of the table and I want a new row added to the table D, with values awer, abcd, asdf, and xyz.
I have tried
data_2 %>% sdf_mutate(node2=ft_regex_tokenizer(data_2, input.col = "A", output.col = "D", pattern="[.]")) %>% sdf_register("mutated")
And then I try
mut_trial %>% mutate(E=D[[1]])
Error in eval(expr, envir, enclos) : object 'D' not found.
I'm not sure if Im doing this the right way but wanted to see if there's any other function to use or if theres a way to fix this function to do what I want.

The code is in the scala spark hope you get the idea and convert it in SparkR
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
("awer.ttp.net","Code", 554),
("abcd.ttp.net","Code", 747),
("asdf.ttp.net","Part", 554),
("xyz.ttp.net","Part", 747)
)).toDF("A","B","C")
data.withColumn("D", split($"A", "\\.")(0)).show(false)
Output:
+------------+----+---+----+
|A |B |C |D |
+------------+----+---+----+
|awer.ttp.net|Code|554|awer|
|abcd.ttp.net|Code|747|abcd|
|asdf.ttp.net|Part|554|asdf|
|xyz.ttp.net |Part|747|xyz |
+------------+----+---+----+
Hope this helped!

Related

Why is knitr choking on a pipe that works fine otherwise?

I've got an RMarkdown script that works fine if I run the chunks manually, either one at a time or with "Run All". But when I try to use knitr to generate HTML or a PDF, I'm getting an error: Error in select(responses, starts_with("Q1 ") & !contains("None")) %>% : could not find function "%>%"
The actual full line reads:
cols <- select(responses, starts_with("Q1 ") & !contains("None") ) %>% colnames()
I'm working with data from a survey, where a lot of questions were "select as many as apply" type questions, and there was an open ended "None of the above" option. At this point, I'm pulling out exactly the columns I want (all the Q1 responses, but not Q10 or Q11 responses, and not the open ended response) so I can use pivot_longer() and summarize the responses. It works fine in the script: I get a list of the exact column names that I want, and then count the values.
But when I try to use knitr() it balks on the %>%.
processing file: 02_Survey2020_report.Rmd
|.... | 6%
ordinary text without R code
|......... | 12%
label: setup (with options)
List of 1
$ include: logi FALSE
|............. | 19%
ordinary text without R code
|.................. | 25%
label: demographics gender
Quitting from lines 28-46 (02_Survey2020_report.Rmd)
Error in select(responses, starts_with("Q1 ") & !contains("None")) %>% :
could not find function "%>%"
Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
Execution halted
A simplified reproducible example gets the same results. I run the following and get what I expect, a tidy table with the count times each answer was selected:
example <- data.frame("id" = c(009,008,007,006,005,004,003,002,001,010), "Q3_Red" = c("","","","Red","","","","Red","Red","Red"), "Q3_Blue" = c("","","","","","Blue","Blue","Blue","",""),
"Q3_Green" = c("","Green","Green","","","","","Green","",""), "Q3_Purple" = c("","Purple","","","Purple","","Purple","","Purple","Purple"),
"Q3_None of the above" = c(009,008,"Verbose explanation that I don't want to count." ,006,005,004,003,002,"Another verbose entry.",010)
)
cols <- select(example, starts_with("Q3") & !contains("None") ) %>% colnames()
example %>%
pivot_longer(cols = all_of(cols),
values_to = "response") %>%
filter(response != "") %>%
count(response)
But when I use ctrlshiftk to output a document, I get the same error:
processing file: 00a_reproducible_examples.Rmd
Quitting from lines 9-25 (00a_reproducible_examples.Rmd)
Error in select(example, starts_with("Q3") & !contains("None")) %>% colnames() :
could not find function "%>%"
Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
Execution halted
Why is knitr balking at a pipe?

recently I run into similar problem.
Not sure if there is more sophisticated solution, but loading library in each chunk of code worked for me.
To display result of a code, w/o message regarding library loading add message = FALSE.
Example:
```{r, echo=FALSE, message = FALSE}
library(dplyr)
>>your code with dplyr<<
```

Convert Pyspark dataframe to dictionary

I'm trying to convert a Pyspark dataframe into a dictionary.
Here's the sample CSV file -
Col0, Col1
-----------
A153534,BDBM40705
R440060,BDBM31728
P440245,BDBM50445050
I've come up with this code -
from rdkit import Chem
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
df = spark.read.csv("gs://my-bucket/my_file.csv") # has two columns
# Creating list
to_list = map(lambda row: row.asDict(), df.collect())
#Creating dictionary
to_dict = {x['col0']: x for x in to_list }
This creates a dictionary like below -
'A153534': {'col0': 'A153534', 'col1': 'BDBM40705'}, 'R440060': {'col0': 'R440060', 'col1': 'BDBM31728'}, 'P440245': {'col0': 'P440245', 'col1': 'BDBM50445050'}
But I want a dictionary like this -
{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}
How can I do that?
I tried the rdd solution by Yolo but I'm getting error. Can you please tell me what I am doing wrong?
py4j.protocol.Py4JError: An error occurred while calling
o80.isBarrier. Trace: py4j.Py4JException: Method isBarrier([]) does
not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Here's a way of doing it using rdd:
df.rdd.map(lambda x: {x.Col0: x.Col1}).collect()
[{'A153534': 'BDBM40705'}, {'R440060': 'BDBM31728'}, {'P440245': 'BDBM50445050'}]

This could help you:
df = spark.read.csv('/FileStore/tables/Create_dict.txt',header=True)
df = df.withColumn('dict',to_json(create_map(df.Col0,df.Col1)))
df_list = [row['dict'] for row in df.select('dict').collect()]
df_list
Output is:
['{"A153534":"BDBM40705"}',
'{"R440060":"BDBM31728"}',
'{"P440245":"BDBM50445050"}']

R dataframe : find list of elements in column 1 of corresponding to items in column 2

Say that I have a dataframe
xy.df <- data.frame(x = runif(10), y = runif(10))
What I want to do is:
Create a list of non-redundant items in column 1
For each item in this list (items in column 1), identify the list of corresponding items in column 2
I have tried some tests with dplyr but I still don't get it!
df = xy.df %>% group_by(xy.df$x)
Any help would be appreciated.

Try this:
Your data.frame:
db<-data.frame(idProcess=c("5aa78","5aa78","9a978"),
ip=c("128.55.12.81","128.55.12.81","130.50.12.99"),
port=c(9265,59264,63925))
Building your output (is not the most efficient way but it'is clear what I'm doing)
list<-NULL
id_unique<-as.character(unique(db$idProcess))
for (i in 1:length(id_unique))
{
ip_i<-unique(as.character(db[as.character(db$idProcess)==id_unique[[i]],"ip"]))
list[eval(id_unique[[i]])]<-list(c(ip_i,unique(as.character(db[as.character(db$idProcess)==id_unique[[i]],"port"]))))
}
Your output
list
$`5aa78`
[1] "128.55.12.81" "9265" "59264"
$`9a978`
[1] "130.50.12.99" "63925"

Sorry I wanted to simplify my problem with the precedent examples, so here a small example of the dataframe
idProcess | ip | port|
5aa78 | 128.55.12.81 | 9265
5aa78 | 128.55.12.81 | 59264
9a978 | 130.50.12.99 | 63925
.....
So what I want to have is a list of lists, where each entry in the global list if the process name, for each process get the list of non redundant IP and non redundant port in one list, i.e.
List["5aa78"]=(128.55.12.81, 9265 , 59264)
List["9a978"]=( 130.50.12.99 , 63925)
....
thanks

Error in writing dataframe in csv

i have below dataframe
df_Place:
Name|Places
----+-----------------------
abc |delhi
bcd |mumbai,delhi
cde |chennai,hyderabad,delhi
def |mumbai
efg |bangalore,mumbai
ghi |delhi,bangalore
i wanted to have places in form of a matrix so i did below operation
df_Place$matrix<-as.matrix(strsplit(df_Place$Place,","))
i get below dataframe
Name|Places |matrix
----+-----------------------+------------------------------
abc |delhi |delhi
bcd |mumbai,delhi |c("mumbai","delhi")
cde |chennai,hyderabad,delhi|c("chennai","hyderabad","delhi")
def |mumbai |mumbai
efg |bangalore,mumbai |c("bangalore","mumbai")
ghi |delhi,bangalore |c("delhi","bangalore")
now while trying to write this into csv
write.csv(df_Place,"tx.csv")
i get below error:
Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type 'list' in 'EncodeElement'
but if i remove the matrix column then it gets written successfully.
i know that it will be very basic, but can someone explain the reason behind this

It has to do with writing a matrix (with multiple dimensions) to a df in which multiple cols have no dimensions (vector). I found this solution to work (see Outputting a Dataframe in R to a .csv)
# First coerce the data.frame to all-character
df_Place2 = data.frame(lapply(df_Place, as.character), stringsAsFactors=FALSE)
# write file
write.csv(df_place2,"tx.csv")

You can use data.table library
fwrite(df_Place, file ="df_Place.csv")

Knitr kable/pandoc/pander/geom_title/grid.table custom cell formatting

I would like to add symbols and letters before and after some numbers when using knitr's kable function, but do not know how to do this efficiently. I am however also willing to consider pandoc/pander if its is better/more efficient.
The end result should be an HTML table...or very good graphic of one....
Please see the following code as a mock reproducible example that is in a .Rmd file:
### Notional and Cumulative P&L
```{r echo=FALSE}
Notional <- 10000
yday_pnl <- -2942
wtd_pnl <- 2300
mtd_pnl <- -3334
ytd_pnl <- 5024
yday_rtn <- (yday_pnl/Notional)*10000
wtd_rtn <- (wtd_pnl/Notional)*10000
mtd_rtn <- (mtd_pnl/Notional)*10000
ytd_rtn <- (ytd_pnl/Notional)*10000
Value <- c(Notional,yday_pnl,wtd_pnl,mtd_pnl,ytd_pnl)
rtn <- c(NA,yday_rtn,wtd_rtn,mtd_rtn,ytd_rtn)
COB.basics <- as.data.frame(cbind(Value,rtn))
rownames(COB.basics) <- c('Notional','yday pnl','wtd_pnl','mtd_pnl','ytd_pnl')
```
```{r results='asis',echo=FALSE}
kable(COB.basics,digits=2)
```
So similar to Excel's format type of currency or accountancy I would like the value field to have the $ sign for the Value column, and for the rtn column I would like to have the string bps after the numbers...also for readability purposes is it possible to have commas after three digits if it is before the decimal point? i.e. to represent thousands etc.
Also is it possible to colour the cells? and also colour the text/numbers too? i.e. red for negative values?

Partial solution with pander:
Set "big mark" for pander so that it would be used for all numbers:
panderOptions('big.mark', ',')
You can also set the table syntax to rmarkdown (optional, as now rmarkdoen v2 also uses Pandoc, where the multiline format has some cool features compared to what rmarkdown format offered before:
panderOptions('table.style', 'rmarkdown')
You can highlight some cells with e.g. which and some custom R expression:
emphasize.strong.cells(which(COB.basics > 0, arr.ind = TRUE))
Simply call pander on your data.frame:
> library(pander)
> emphasize.strong.cells(which(COB.basics > 0, arr.ind = TRUE))
> panderOptions('big.mark', ',')
> pander(COB.basics)
-----------------------------------
Value rtn
-------------- ---------- ---------
**Notional** **10,000** NA
**yday pnl** -2,942 -2,942
**wtd_pnl** **2,300** **2,300**
**mtd_pnl** -3,334 -3,334
**ytd_pnl** **5,024** **5,024**
-----------------------------------
> panderOptions('table.style', 'rmarkdown')
> pander(COB.basics)
| | Value | rtn |
|:--------------:|:-------:|:------:|
| **Notional** | 10,000 | NA |
| **yday pnl** | -2,942 | -2,942 |
| **wtd_pnl** | 2,300 | 2,300 |
| **mtd_pnl** | -3,334 | -3,334 |
| **ytd_pnl** | 5,024 | 5,024 |
To color the cells, you could add some custom HTML/CSS markup manually (or LaTeX if working with pdf in the long run), and the same stands also for adding % or other symbols/strings to your cells with e.g. paste and apply -- but pls feel free to submit a feature request at https://github.com/Rapporter/pander

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sparklyr split strings - r

Related

Why is knitr choking on a pipe that works fine otherwise?

Convert Pyspark dataframe to dictionary

R dataframe : find list of elements in column 1 of corresponding to items in column 2

Error in writing dataframe in csv

Knitr kable/pandoc/pander/geom_title/grid.table custom cell formatting

Categories

Resources