I was looking for a way to make spark_write_csv to upload only a single file to S3 because I want to save the regression result on S3. I was wondering if options has some parameter which defines number of partitions. I could not find it anywhere in the documentation. Or is there any other efficient way to upload resultant table to S3?
Any help is appreciated!
options argument is equivalent options call on the DataFrameWriter (you can check DataFrameWriter.csv documentation for a full list of options specific to CSV source) and it cannot be used to control the number of the output partitions.
While in general it is not recommended, you can use Spark API to coalesce data and convert it back to sparklyr tbl:
df %>%
spark_dataframe() %>%
invoke("coalesce", 1L) %>%
invoke("createOrReplaceTempView", "_coalesced")
tbl(sc, "_coalesced") %>% spark_write_csv(...)
or, in the recent versions, sparklyr::sdf_coalesce
df %>% sparklyr::sdf_coalesce()
Related
I am doing some text analysis in R Studio, and as part of this analysis I have split my data frame into various tibbles, split by a column in my data called "topic". This has worked successfully.
All I need to do now is find some way to export each of those tibbles into a csv, or xlxs or even html - anything that will let me look through them properly.
Has anyone got any solutions for this? Feels like it should be something easy to do but in my research it is not.
A screenshot of the tibbles I am trying to export
Thanks
You may use map or lapply to write each dataframe to a csv. However, group_split does not give names to the list. To get proper names of the csv you can use split and imap together.
For example, with iris dataset -
library(tidyverse)
iris %>%
split(.$Species) %>%
imap(~write_csv(.x, paste0(.y, '.csv')))
This creates 3 csvs named virginia.csv, versicolor.csv and setosa.csv in the working directory.
If you don't mind using a for() loop try this. You could probably find a better way to name the list items, but this works.
my_list <- mtcars %>%
group_split(gear)
names(my_list) <- 1:length(my_list)
for(i in 1:length(my_list)){
write_csv(my_list[[i]], file = paste0(names(my_list)[i], ".csv"))
}
I am using Sparklyr for a project and have understood that persisting is very useful. I am using sdf_persist for this, with the following syntax (correct me if I am wrong):
data_frame <- sdf_persist(data_frame)
Now I am reaching a point where I have too many RDDs stored in memory, so I need to unpersist some. However I cannot seem to find the function to do this in Sparklyr. Note that I have tried:
dplyr::db_drop_table(sc, "data_frame")
dplyr::db_drop_table(sc, data_frame)
unpersist(data_frame)
sdf_unpersist(data_frame)
But none of those work.
Also, I am trying to avoid using tbl_cache (in which case it seems that db_drop_table works) as it seems that sdf_persist offers more liberty on the storage level. It might be that I am missing the big picture of how to use persistence here, in which case, I'll be happy to learn more.
If you don't care about granularity then the simplest solution is to invoke Catalog.clearCache:
spark_session(sc) %>% invoke("catalog") %>% invoke("clearCache")
Uncaching specific object is much less straightforward due to sparklyr indirection. If you check the object returned by sdf_cache you'll see that the persisted table is not exposed directly:
df <- copy_to(sc, iris, memory=FALSE, overwrite=TRUE) %>% sdf_persist()
spark_dataframe(df) %>%
invoke("storageLevel") %>%
invoke("equals", invoke_static(sc, "org.apache.spark.storage.StorageLevel", "NONE"))
[1] TRUE
That's beacuase you don't get registered table directly, but rather a result of subquery like SELECT * FROM ....
It means you cannot simply call unpersist:
spark_dataframe(df) %>% invoke("unpersist")
as you would in one of the official API's.
Instead you can try to retrieve the name of the source table, for example like this
src_name <- as.character(df$ops$x)
and then invoke Catalog.uncacheTable:
spark_session(sc) %>% invoke("catalog") %>% invoke("uncacheTable", src_name)
That is likely not the most robust solution, so please use with caution.
I'm using twitteR to get the followers for a few handles. When fetching a single user, this code works:
test <- getUser("BarackObama")
test_friends <- test$getFriends(10) %>%
twListToDF() %>%
tibble::rownames_to_column() %>%
mutate(id = rowname) %>%
select(name, everything())
However, I'm not sure what's the cleanest way to iterate over a list of handles. The main obstacle I see at the moment is that I don't know how to pipe/vectorize over the getFriends() method (contra a getFriends() function). Plus, the object returned by getFriends() is not a DF, but has to be flattened (?) by twListToDF(), to then use rbind().
For looping, this is as far as I got:
handles <- c("BarackObama", "ThePresObama")
while (i < length(handles)) {
user <- getUser(handles[i])
friends <- user$getFriends() %>%
twListToDF()
}
With a bit more tinkering, I think I could get this to work, but I'm not sure if it's the best approach.
Alternatively, using rtweet, it seems like there is a more elegant solution that might accomplish your goal. It extracts the followers of specified users into a dataframe, looks up followers by user, then binds that result to the original dataframe using left_join so that you can distinguish which users correspond to which followers.
library(rtweet)
handles <- c("BarackObama", "ThePresObama")
handles.friends <- get_friends(handles)
handles.data <- lookup_users(handles.friends$user_id) %>%
left_join(handles.friends)
The pmap_* functions from purrr might also help implement a solution using the twitteR library, and have generally helped me to implement non-vectorized functions, but unfortunately I'm unable to get twitteR authentication working.
I have a Postgres database. I want to find the minimum value of a column called calendarid, which is of type integer and the format yyyymmdd, from a certain table. I am able to do so via the following code.
get_history_startdate <- function(src) {
get_required_table(src) %>% # This gives me the table tbl(src, "table_name")
select(calendarid) %>%
as_data_frame %>%
collect() %>%
min() # Result : 20150131
}
But this method is really slow as it loads all the data from the database to the memory. Any ideas how can I improve it?
get_required_table(src) %>%
summarise(max(calendarid, na.rm = TRUE)) %>%
pull
will run the appropriate SQL query.
If you just want the minimum value of the calendarid column across the entire table, then use this:
SELECT MIN(calendarid) AS min_calendarid
FROM your_table;
I don't exactly what your R code is doing under the hood, but if it's bringing in the entire table from Postgres into R, then it is very wasteful. If so, then running the above query directly on Postgres should give you a boost in performance.
I am trying to use invoke in RStudio's Sparklyr to do a simple word count off of a text file in HDFS and have not figured out the syntax. I can get the whole file back as a list by using (similar to the count example in the SparklyR doc on extensions - http://spark.rstudio.com/extensions.html):
getFileCollect <- function(sc, path) {
spark_context(sc) %>%
invoke("textFile", path, 1L) %>%
invoke("collect")
}
fc <- getFileCollect(sc, "hdfs:///tmp/largeTomes/bigEx.txt")
What I want to do is a flatmap on that text file to do the classic scala code:
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
But haven't even come close on figuring out the syntax of invoke. flatMap is a method of textFile. Surely somebody has done this before and I'm just not thinking right.
Thank you!
Maybe you figured it out (it has been a few months), but for using collect with dplyr, you just have to put it that way:
myFileCollected <- myFileDF %>% collect
Then i would try to use dplyr functions, ie mutate (thats one of sparklyr advantages).