How exactly are UDFs working in SparkR? - r

Let's say I defined a R function that takes two numerics as inputs :
effectifTouche <- function(audience, extrapolated){
TM = audience / 1000000
VE= extrapolated/100
TME = TM * VE
nbVis = TME / 1000000.1
return (nbVis)
}
And it gives me back a score, so I would like to use it as an udf on two columns of SparkR DataFrame.
It was working in pyspark, and I was wondering how was SparkR working.
So I tried many things in both Sparklyr and SparkR but I can't get this UDF working.
Ideally, I would love to just do this :
df %>%
dapply(df_join,
function(p) { effectifTouche(p$audience,p$extrapolated)
})
effectifTouche being my R function and audience, extrapolated my two columns of the spark DataFrame.
I will gladly take answers for both libraries SparkR and Sparklyr, because I tried both, and checked every single github issues with no success.
Thanks a lot
Edit for another tricky use case
df %>%
mutate(my_var = as.numeric(strptime(endHour,format="%H:%M:%S"),unit="secs"))

With simple arithmetic like this you're probably better off pushing the computation to Spark SQL, e.g.
df %>%
mutate(TM = audience / 1000000,
VE = extrapolated / 100,
TME = TM * VE,
nbVis = TME / 1000000.1)
If you actually need to use external R packages, we can help you better if you provide an example of df.

Related

ENP function in mutate

currently, I am cleaning my dataset (Comparative Manifesto Project) and try to compute the effective number of parties using the enp function from the electoral package (https://www.rdocumentation.org/packages/electoral/versions/0.1.2/topics/enp). However, I am running in some issues.
When I run this code:
cmp_1990 %>%
mutate(enp_vote = round(pervote, digits = 2)) %>%
mutate(enp_vote = as.numeric(enp_vote)) %>%
relocate(enp_vote, .before = parfam) %>%
mutate(enp_vote = enp(votes = cmp_1990$enp_vote)) %>%
relocate(enp, .before = parfam)
I get the error message:
Fehler: Can't subset columns that don't exist.
x Column `enp` doesn't exist.
I suppose, r thinks of the function enp as single column even though I have installed and used library on the package.
I tried it with differently rounded numbers and by using the enp command outside of the rest of the command but up until now nothing worked. Oh and the cmp_1990$enp_vote command was necessary as otherwise the enp function thought of enp_vote as categorical and not numerical value.
Sorry by the way if my code doesnt look like the nicest, its my first time using r haha.
Thanks very much in advance!

Running parallel function calls with sparklyr

Currently, I am using foreach loop from doparallel library to run function calls in parallel across multiple cores of the same machine, which looks something like this:
out_results=foreach(i =1:length(some_list))%dopar%
{
out=functions_call(some_list[[i]])
return(out)
}
This some_list is a list of data frames and each data frame would have different number of columns, the function_call() is a function that does multiple things to the data such as data manipulations,then uses random forest for variable selection and then finally performs a least squares fit. The variable out is again a list of 3 data frames, and out_results will be a list of lists.
I am using CRAN libraries and some custom libraries created by me inside the function call, I want to avoid using spark ML libraries due to their limited functionality and re-writing of the entire code.
I want to leverage spark for running these function calls in parallel. Is it possible to do so? If yes in which direction should I be thinking. I have read a lot of documentation from sparklyr, but it doesn't seem to help much since the examples provided there are very straightforward.
SparklyR's homepage gives examples of arbitrary R code distributed on the Spark cluster. In particular, see their example with grouped operations.
Your main structure should be a data frame, which you will process rowwise. Probably something like the following (not tested):
some_list = list(tibble(a=1[0]), tibble(b=1), tibble(c=1:2))
all_data = tibble(i = seq_along(some_list), df = some_list)
# Replace this with your actual code.
# Should get one dataframe and produce one dataframe.
# Embedded dataframe columns are OK
transform_one = function(df_wrapped) {
# in your example, you expect only one record per group
stopifnot(nrow(df_wrapped)==1)
df = df_wrapped$df
res0 = df
res1 = tibble(x=10)
res2 = tibble(y=10:11)
return(tibble(res0 = list(res0), res1 = list(res1), res2 = list(res2)))
}
all_data %>% spark_apply(
transform_one,
group_by = c("i"),
columns = c("res0"="list", "res1"="list", "res2"="list"),
packages = c("randomForest", "etc")
)
All in all, this approach seems unnatural, as if we were forcing the use of Spark on a task which does not really fit. Maybe you should check for another parallelization framework?

R code optimization: For loop and writing to a database

I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.

How to convert dataframe to xml using xml2 package?

I'm trying to update an xml file with new nodes using xml2. It's easy if I just write everything manually as text,
oldXML <- read_xml("<Root><Trial><Number>3.14159 </Number><Adjective>Fast </Adjective></Trial></Root>")
but I'm developing an application that will run calculations and then put those values into the xml, so I need a mix of character and variables. It ends up looking like:
var1 <- 4.567
var2 <- "Slow"
newLine <- read_xml(paste0("<Trial><Number>",var1," </Number><Adjective>",var2," </Adjective></Trial>"))
xml_add_child(oldXML,newLine)
I suspect there's a much less kludgy way to do this than using paste0, but I can't get anything else to work. I'd like to be able to just instruct it to update the xml by reference to the dataframe, such that it can create new trials:
<Trial>
<Number>df$number[1]</Number>
<Adjective>df$adjective[1]</Adjective>
</Trial>
<Trial>
<Number>df$number[2]</Number>
<Adjective>df$adjective[2]</Adjective>
</Trial>
Is there any way to create new Trial nodes in approximately that fashion, or at least more naturally than using paste0 to insert variables? Is this something the XML package does better than xml2?
If you have your new values in a data.frame like this:
vars <- data.frame(Number = c(4.567, 3.211),
Adjective = c("Slow", "Slow"),
stringsAsFactors = FALSE)
you can convert it to a list of xml_document's as follows:
vars_xml <- lapply(purrr::transpose(vars),
function(x) {
as_xml_document(list(Trial = lapply(x, as.list)))
})
Then you can add the new nodes to the original xml:
for(trial in vars_xml) xml_add_child(oldXML, trial)
I don't know that this is better than your paste approach. Either way, you can wrap it in a function so you only have to write the ugly code once.
Here's a solution that builds on #Ista's excellent answer. Basically, I've dropped the first lapply in favor of purrr::map (we could probably replace the second lapply with a map, but I couldn't find a more readable way to accomplish that).
library(purrr)
vars_xml <- transpose(vars) %>%
map(~as_xml_document(list(Trial = lapply(.x, as.list))))

Creating a User-Item Matrix for Collaborative Filtering

I am attempting to run a Collaborative Filtering (CF) algorithm on a "User-Item-Rating" data. My data is in a long format i.e. each row has data for a User rating a specific item. I need to convert this into a "User-Item" matrix before I can apply a CF algorithm on it.
I am using the spread function from the tidyr package for this task. But given that I have more than 50k unique items, the resulting dataframe would be huge. R is unable to execute this (on my local machine) and throws up the "cannot allocate vector of size" error.
What's the best way to deal with this? Some of the options I tried exploring, but was unable to get them to work:
I was thinking if there is a way to return the output of spread call as a Sparse Matrix
I also tried exploring if packages which implements CF such as recommenderlab has an option to deal with this. But I could not see any option for that.
Any help will be greatly appreciated.
Thanks!
As you (probably) got sparse data, go with a sparse matrix. Here's an example for 50000 sparse example ratings:
library(stringi)
library(Matrix)
set.seed(1)
df <- data.frame(item = stri_rand_strings(50000, 4))
df$user <- as.factor(1:nrow(df))
df$rating <- sample(1:10, nrow(df), T)
m <- sparseMatrix(
i = as.integer(df$user),
j = as.integer(df$item),
x = df$rating,
dimnames = list(levels(df$user), levels(df$item))
)

Resources