I am trying to use invoke in RStudio's Sparklyr to do a simple word count off of a text file in HDFS and have not figured out the syntax. I can get the whole file back as a list by using (similar to the count example in the SparklyR doc on extensions - http://spark.rstudio.com/extensions.html):
getFileCollect <- function(sc, path) {
spark_context(sc) %>%
invoke("textFile", path, 1L) %>%
invoke("collect")
}
fc <- getFileCollect(sc, "hdfs:///tmp/largeTomes/bigEx.txt")
What I want to do is a flatmap on that text file to do the classic scala code:
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
But haven't even come close on figuring out the syntax of invoke. flatMap is a method of textFile. Surely somebody has done this before and I'm just not thinking right.
Thank you!
Maybe you figured it out (it has been a few months), but for using collect with dplyr, you just have to put it that way:
myFileCollected <- myFileDF %>% collect
Then i would try to use dplyr functions, ie mutate (thats one of sparklyr advantages).
Related
I am using Sparklyr for a project and have understood that persisting is very useful. I am using sdf_persist for this, with the following syntax (correct me if I am wrong):
data_frame <- sdf_persist(data_frame)
Now I am reaching a point where I have too many RDDs stored in memory, so I need to unpersist some. However I cannot seem to find the function to do this in Sparklyr. Note that I have tried:
dplyr::db_drop_table(sc, "data_frame")
dplyr::db_drop_table(sc, data_frame)
unpersist(data_frame)
sdf_unpersist(data_frame)
But none of those work.
Also, I am trying to avoid using tbl_cache (in which case it seems that db_drop_table works) as it seems that sdf_persist offers more liberty on the storage level. It might be that I am missing the big picture of how to use persistence here, in which case, I'll be happy to learn more.
If you don't care about granularity then the simplest solution is to invoke Catalog.clearCache:
spark_session(sc) %>% invoke("catalog") %>% invoke("clearCache")
Uncaching specific object is much less straightforward due to sparklyr indirection. If you check the object returned by sdf_cache you'll see that the persisted table is not exposed directly:
df <- copy_to(sc, iris, memory=FALSE, overwrite=TRUE) %>% sdf_persist()
spark_dataframe(df) %>%
invoke("storageLevel") %>%
invoke("equals", invoke_static(sc, "org.apache.spark.storage.StorageLevel", "NONE"))
[1] TRUE
That's beacuase you don't get registered table directly, but rather a result of subquery like SELECT * FROM ....
It means you cannot simply call unpersist:
spark_dataframe(df) %>% invoke("unpersist")
as you would in one of the official API's.
Instead you can try to retrieve the name of the source table, for example like this
src_name <- as.character(df$ops$x)
and then invoke Catalog.uncacheTable:
spark_session(sc) %>% invoke("catalog") %>% invoke("uncacheTable", src_name)
That is likely not the most robust solution, so please use with caution.
I am looking to create a custom function in R that will allow the user to call the function and then it will produce an auto complete pipeline ready for them to edit, this way they can jump into quickly customizing the standard pipeline instead of copy pasting from an old script or retyping. How can I go about setting up this sort of autocomplete:
#pseudo code what I type---
seq.Date(1,2,by = "days") %>%
pblapply(function(x){
read.fst(list.files(as.character(x), as.data.table = T) %>%
group_by(x) %>%
count()
}) %>% rbindlist()
#how can I write a function so that when I call that function, it ouputs an autocomplete
#of the above so that I can go ahead and start just customizing the code? Something like this
my_autocomplete_function = function(x) {
print(
"
seq.Date(as.Date(Sys.Date()),as.Date(Sys.Date()+1),by = 'days') %>%
pbapply::pblapply(function(x){
fst::read.fst(list.files(as.character(x), as.data.table = T)) %>%
#begin filtering and grouping by below custom
group_by()
}) %>% rbindlist()
")
}
#I can just print the function and copy paste the text from the output in my console to my script
my_autocomplete_function()
#but I rather it just autocomplete and appear in the script if possible?
Putting text into the command line will probably be a function of the interface you are using to run R - is it plain R, Rstudio, etc?
One possibility might be to use the clipr package and put the code into the clipboard, then prompt the user to hit their "paste" button to get it on the command line. For example this function which creates a little code string:
> writecode = function(x){
code = paste0("print(",x,")")
clipr::write_clip(code)
message("Code ready to paste")}
Use it like this:
> writecode("foo")
Code ready to paste
Then when I hit Ctrl-V to paste, I see this:
> print(foo)
I can then edit that line. Any string will do:
> writecode("bar")
Code ready to paste
[ctrl-V]
> print(bar)
Its one extra key for your user to press, but having a chunk of code appear on the command line with no prompting might be quite surprising for a user.
I spend my day reading the source code for utils auto completion. The linebuffer only contains the code for ... one line, so that can't do fully what your looking for here, but it is quite customizable. Maybe the rstudio source code for autocompletion has more answers. Here is an example to write and activate your own custom auto completer. It is not well suited for packages as any new pacakge could overwrite the settings.
Below a
#load this function into custom.completer setting to activate
rc.options("custom.completer" = function(x) {
#perhaps debug it, it will kill your rsession sometimes
#browser()
###default completion###
##activating custom deactivates anything else
#however you can run utils auto completer also like this
#rstudio auto completation is not entirely the same as utils
f = rc.getOption("custom.completer")
rc.options("custom.completer" = NULL)
#function running base auto complete.
#It will dump suggestion into mutable .CompletionEnv$comps
utils:::.completeToken() #inspect this function to learn more about completion
rc.options("custom.completer" = f)
###your custom part###
##pull this environment holding all input/output needed
.CompletionEnv = utils:::.CompletionEnv
#perhaps read the 'token'. Also 'linebuffer', 'start' & 'end' are also useful
token = .CompletionEnv$token
#generate a new completion or multiple...
your_comps = paste0(token,c("$with_tomato_sauce","$with_apple_sauce"))
#append your suggestions to the vanilla suggestions/completions
.CompletionEnv$comps = c(your_comps,.CompletionEnv$comps)
print(.CompletionEnv$comps)
#no return used
NULL
})
xxx<tab>
NB! currently rstudio IDE will backtick quote any suggestion which is not generated by them. I want to raise an issue on that someday.
Bonus info: A .DollarNames.mys3class-method can be very useful and works with both rstudio and utils out of the box e.g.
#' #export
.DollarNames.mys3class = function(x, pattern = "") {
#x is the .CompletionEnv
c("apple","tomato")
}
I'm using twitteR to get the followers for a few handles. When fetching a single user, this code works:
test <- getUser("BarackObama")
test_friends <- test$getFriends(10) %>%
twListToDF() %>%
tibble::rownames_to_column() %>%
mutate(id = rowname) %>%
select(name, everything())
However, I'm not sure what's the cleanest way to iterate over a list of handles. The main obstacle I see at the moment is that I don't know how to pipe/vectorize over the getFriends() method (contra a getFriends() function). Plus, the object returned by getFriends() is not a DF, but has to be flattened (?) by twListToDF(), to then use rbind().
For looping, this is as far as I got:
handles <- c("BarackObama", "ThePresObama")
while (i < length(handles)) {
user <- getUser(handles[i])
friends <- user$getFriends() %>%
twListToDF()
}
With a bit more tinkering, I think I could get this to work, but I'm not sure if it's the best approach.
Alternatively, using rtweet, it seems like there is a more elegant solution that might accomplish your goal. It extracts the followers of specified users into a dataframe, looks up followers by user, then binds that result to the original dataframe using left_join so that you can distinguish which users correspond to which followers.
library(rtweet)
handles <- c("BarackObama", "ThePresObama")
handles.friends <- get_friends(handles)
handles.data <- lookup_users(handles.friends$user_id) %>%
left_join(handles.friends)
The pmap_* functions from purrr might also help implement a solution using the twitteR library, and have generally helped me to implement non-vectorized functions, but unfortunately I'm unable to get twitteR authentication working.
I was looking for a way to make spark_write_csv to upload only a single file to S3 because I want to save the regression result on S3. I was wondering if options has some parameter which defines number of partitions. I could not find it anywhere in the documentation. Or is there any other efficient way to upload resultant table to S3?
Any help is appreciated!
options argument is equivalent options call on the DataFrameWriter (you can check DataFrameWriter.csv documentation for a full list of options specific to CSV source) and it cannot be used to control the number of the output partitions.
While in general it is not recommended, you can use Spark API to coalesce data and convert it back to sparklyr tbl:
df %>%
spark_dataframe() %>%
invoke("coalesce", 1L) %>%
invoke("createOrReplaceTempView", "_coalesced")
tbl(sc, "_coalesced") %>% spark_write_csv(...)
or, in the recent versions, sparklyr::sdf_coalesce
df %>% sparklyr::sdf_coalesce()
Is there an equivalent to the unix less command that can be used within the R console?
There is also page() which displays a representation of an object in a pager, like less.
dat <- data.frame(matrix(rnorm(1000), ncol = 10))
page(dat, method = "print")
Not really. There are the commands
head() and tail() for showing the beginning and end of objects
print() for explicitly showing an object, and just its name followed by return does the same
summary() for concise summary that depends on the object
str() for its structure
and more. An equivalent for less would be a little orthogonal to the language and system. Where the Unix shell offers you less to view the content of a file (which is presumed to be ascii-encoded), it cannot know about all types.
R is different in that it knows about the object types which is why summary() -- as well as the whole modeling framework -- are more appropriate.
Follow-up edit: Another possibility is provided by edit() as well as edit.data.frame().
I save the print output to a file and then read it using an editor or less.
Type the following in R
sink("Routput.txt")
print(varname)
sink()
Then in a shell:
less Routput.txt
If the file is already on disk, then you can use file.show
You might like my little toy here:
short <- function(x=seq(1,20),numel=4,skipel=0,ynam=deparse(substitute(x))) {
ynam<-as.character(ynam)
#clean up spaces
ynam<-gsub(" ","",ynam)
#unlist goes by columns, so transpose to get what's expected
if(is.list(x)) x<-unlist(t(x))
if(2*numel >= length(x)) {
print(x)
}
else {
frist=1+skipel
last=numel+skipel
cat(paste(ynam,'[',frist,'] thru ',ynam,'[',last,']\n',sep=""))
print(x[frist:last])
cat(' ... \n')
cat(paste(ynam,'[',length(x)-numel-skipel+1,'] thru ', ynam, '[', length(x)-skipel,']\n',sep=""))
print(x[(length(x)-numel-skipel+1):(length(x)-skipel)])
}
}
blahblah copyright by me, not Disney blahblah free for use, reuse, editing, sprinkling on your Wheaties, etc.