How to unpersist in Sparklyr? - r

I am using Sparklyr for a project and have understood that persisting is very useful. I am using sdf_persist for this, with the following syntax (correct me if I am wrong):
data_frame <- sdf_persist(data_frame)
Now I am reaching a point where I have too many RDDs stored in memory, so I need to unpersist some. However I cannot seem to find the function to do this in Sparklyr. Note that I have tried:
dplyr::db_drop_table(sc, "data_frame")
dplyr::db_drop_table(sc, data_frame)
unpersist(data_frame)
sdf_unpersist(data_frame)
But none of those work.
Also, I am trying to avoid using tbl_cache (in which case it seems that db_drop_table works) as it seems that sdf_persist offers more liberty on the storage level. It might be that I am missing the big picture of how to use persistence here, in which case, I'll be happy to learn more.

If you don't care about granularity then the simplest solution is to invoke Catalog.clearCache:
spark_session(sc) %>% invoke("catalog") %>% invoke("clearCache")
Uncaching specific object is much less straightforward due to sparklyr indirection. If you check the object returned by sdf_cache you'll see that the persisted table is not exposed directly:
df <- copy_to(sc, iris, memory=FALSE, overwrite=TRUE) %>% sdf_persist()
spark_dataframe(df) %>%
invoke("storageLevel") %>%
invoke("equals", invoke_static(sc, "org.apache.spark.storage.StorageLevel", "NONE"))
[1] TRUE
That's beacuase you don't get registered table directly, but rather a result of subquery like SELECT * FROM ....
It means you cannot simply call unpersist:
spark_dataframe(df) %>% invoke("unpersist")
as you would in one of the official API's.
Instead you can try to retrieve the name of the source table, for example like this
src_name <- as.character(df$ops$x)
and then invoke Catalog.uncacheTable:
spark_session(sc) %>% invoke("catalog") %>% invoke("uncacheTable", src_name)
That is likely not the most robust solution, so please use with caution.

Related

why "separate" and "unite" function don´t not work in dplyr

I used the function separate and unite to clean some data but they don´t seem to work
I've been trying to separate a column string into two columns using dplyr. The function is quite easy and I don't know why it does not work.
The variable (column) I want to separate is season which contains values of “MAD_S1, KGA_S1” etc. (thousands of records, but there are 6 categories, all separated by the “_S1”; raw data has been inspected and all follow the same syntax). Therefore, I applied
separate(six_sites_spp, season, c("code_loc","season1"), sep = "_")
I have tried more explicit script such as:
separate(six_sites_spp,
col = "season",
into = c("code_loc", "season1"),
sep = "_")
but nothing either.
I have updated the dplyr versions, and tried several things. If I use unite instead to merge two columns, it does not work either. I resolved this by using the classic paste function, but not for the splitting; I do however want to know why dplyr does not work (this is a great package and for some reason other commands are not working either).
Would anyone be able to provide feedback on this, please? Is it a possible “bug” or something within my system (Windows10, HP envi)? Do I need another package simultaneously (I also use tidyr in the same script)? Any version mismatch (my R version 3.5.1 (2018-07-02)? When I run the code it does something internally, as I see it runs the commands, but the output is the same data frame (i.e. no new variables code_loc, season1.
Many thanks in advance.
*there are no error messages
Since you mention no error message, I assume the function works properly but you simply fail to save the output.
Usually dplyr flows like this:
library(dplyr)
six_sites_spp %>%
separate(season, c("code_loc", "season1"), sep = "_")) %>%
{.} -> six_sites_spp # This saves the changed data frame under the old name
Alternatively, this works as well:
six_sites_spp <- separate(six_sites_spp,season, c("code_loc", "season1"), sep = "_"))
Naturally you could also save the changed data frame under a new name to preserve the original data.

Can the use of return() after a magrittr pipeline to write an object break something?

I often have to manually clean up entries in datasets.
For flexibility and readability, I like to use pipes.
Sometimes I later come across another thing I need to clean up, so I keep a copy-paste'able line in my magrittr pipeline.
In ggplot2, calling an empty theme() at the end helps me to keep coding flexible for later additions. I never encountered problems with that and thought I could do the same with return() in a pipeline.
I think return() is not be meant to use outside functions, so: is there a conceivable way this could break my code?
I also stumbled upon {.} ans an alternative, but I don't really know what it does and searching info (even using advanced search on SO) is not helping.
Example:
starwars %<>%
mutate(hair_color = ifelse(name == "Captain Phasma", "blond", hair_color)) %>%
mutate(skin_color = ifelse(name == "Captain Phasma", "fair", hair_color)) %>%
mutate(hair_color = ifelse(name == "Zam Wesell", "blond", hair_color)) %>%
#mutate(var = ifelse(name == "cond", "replacement", var)) %>% ### for future c/p
return() #
NB: I realise this is borderline the tag "coding-style", so I like point out I'm not interested in an opinion-based discussion but advice if this could break my code in certain circumstances. Examples where it does break code are welcome, as are alternative suggestions.
I think these topics/threads are related:
https://github.com/tidyverse/magrittr/issues/32
Use $ dollar sign at end of of an R magrittr pipeline to return a vector
how to feed the result of a pipe chain (magrittr) to an object
As put forward by #Roland and #RolandASc: ?identity() is doing what I wanted. I am using it since, and haven't encountered surprises so far.
Further related discussion found over here at the RStudio community.

How to vectorize/pipe through methods?

I'm using twitteR to get the followers for a few handles. When fetching a single user, this code works:
test <- getUser("BarackObama")
test_friends <- test$getFriends(10) %>%
twListToDF() %>%
tibble::rownames_to_column() %>%
mutate(id = rowname) %>%
select(name, everything())
However, I'm not sure what's the cleanest way to iterate over a list of handles. The main obstacle I see at the moment is that I don't know how to pipe/vectorize over the getFriends() method (contra a getFriends() function). Plus, the object returned by getFriends() is not a DF, but has to be flattened (?) by twListToDF(), to then use rbind().
For looping, this is as far as I got:
handles <- c("BarackObama", "ThePresObama")
while (i < length(handles)) {
user <- getUser(handles[i])
friends <- user$getFriends() %>%
twListToDF()
}
With a bit more tinkering, I think I could get this to work, but I'm not sure if it's the best approach.
Alternatively, using rtweet, it seems like there is a more elegant solution that might accomplish your goal. It extracts the followers of specified users into a dataframe, looks up followers by user, then binds that result to the original dataframe using left_join so that you can distinguish which users correspond to which followers.
library(rtweet)
handles <- c("BarackObama", "ThePresObama")
handles.friends <- get_friends(handles)
handles.data <- lookup_users(handles.friends$user_id) %>%
left_join(handles.friends)
The pmap_* functions from purrr might also help implement a solution using the twitteR library, and have generally helped me to implement non-vectorized functions, but unfortunately I'm unable to get twitteR authentication working.

Fastest way to get minimum data from Database

I have a Postgres database. I want to find the minimum value of a column called calendarid, which is of type integer and the format yyyymmdd, from a certain table. I am able to do so via the following code.
get_history_startdate <- function(src) {
get_required_table(src) %>% # This gives me the table tbl(src, "table_name")
select(calendarid) %>%
as_data_frame %>%
collect() %>%
min() # Result : 20150131
}
But this method is really slow as it loads all the data from the database to the memory. Any ideas how can I improve it?
get_required_table(src) %>%
summarise(max(calendarid, na.rm = TRUE)) %>%
pull
will run the appropriate SQL query.
If you just want the minimum value of the calendarid column across the entire table, then use this:
SELECT MIN(calendarid) AS min_calendarid
FROM your_table;
I don't exactly what your R code is doing under the hood, but if it's bringing in the entire table from Postgres into R, then it is very wasteful. If so, then running the above query directly on Postgres should give you a boost in performance.

What is the options parameter of spark_write_csv dplyr function?

I was looking for a way to make spark_write_csv to upload only a single file to S3 because I want to save the regression result on S3. I was wondering if options has some parameter which defines number of partitions. I could not find it anywhere in the documentation. Or is there any other efficient way to upload resultant table to S3?
Any help is appreciated!
options argument is equivalent options call on the DataFrameWriter (you can check DataFrameWriter.csv documentation for a full list of options specific to CSV source) and it cannot be used to control the number of the output partitions.
While in general it is not recommended, you can use Spark API to coalesce data and convert it back to sparklyr tbl:
df %>%
spark_dataframe() %>%
invoke("coalesce", 1L) %>%
invoke("createOrReplaceTempView", "_coalesced")
tbl(sc, "_coalesced") %>% spark_write_csv(...)
or, in the recent versions, sparklyr::sdf_coalesce
df %>% sparklyr::sdf_coalesce()

Resources