listcolumns and multidplyr - r

I am new to multidplyr. I have a dataset similar to what this creates:
library(multidplyr)
library(tidyverse)
library(nycflights13)
f<-flights %>% group_by(month) %>% nest()
Now I´d like to do operations on each of these tibbles on different nodes.
cluster <- create_cluster(12)
f2<-partition(f,month,cluster=cluster)
everything seems ok until here, but when I do:
models<-f2 %>%
do(mod=lm(arr_delay~dey_delay,data=.))
I get the following error msg:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
12 nodes produced errors; first error: object 'arr_delay' not found
Now if I try
f2 %>% browser(.)
and then try .$ I do not have access to any of the columns-
Any ideas how these columns can be accessed?

This question has two parts:
1. Why are you getting an error using do?
The "proper" way to apply functions to a nested column (or "list column") is not to use do, but to use map instead. In this case, multidplyr isn't really important, since the normal dplyr code gives the same error.
f <- flights %>% group_by(month) %>% nest()
models <- f %>%
do(mod = lm(arr_delay ~ dey_delay, data = .))
Error in eval(expr, envir, enclos) : object 'arr_delay' not found
Using map from purrr on the other hand works fine.
models <- f %>%
mutate(model = purrr::map(data, ~ lm(arr_delay ~ dep_delay, data = .)))
Using your multidplyr code with mutate and map also works just fine.
2. How can I view the data in a party_df?
You can't easily do that. Remember they are not available in your current R session, but on the nodes. You can access the names using this little utility function:
names.party_df <- function(x) {
fun <- function(x) names(eval(x))
multidplyr::cluster_call(x$cluster, fun, as.name(x$name))[[1]]
}
But to access the full data, you'll most likely need to collect your data again. Alternatively, in RStudio one can use View, but note that this doesn't work great on large data sets.

Related

How to bind Binary Rating Matrices with different columns?

I'm currently working with the package *recommenderlab *and have run into some memory issues because I work with a lot of data. The problem lies in the creation of the matrices, so I thought I could solve this by using a function, that merges small matrices together to one big matrix.
S1 <- S1 %>%
select(SessionID, material_number) %>%
mutate(value = 1) %>%
spread(material_number,value, fill = 0) %>%
select(-SessionID) %>%
as.matrix() %>%
as("binaryRatingMatrix")
S2 <- S2 %>%
select(SessionID, material_number) %>%
mutate(value = 1) %>%
spread(material_number,value) %>%
select(-SessionID) %>%
as.matrix() %>%
as("binaryRatingMatrix")
Now I want to somehow bind these 2 matrices. Is this possible and do you have some ideas? I tried so many different approaches and run in many errors...
If you have any creative other ideas to fight the memory issues, I will look forward to discuss these with you :)
That is the link to the package/class: https://github.com/cran/recommenderlab/blob/master/R/binaryRatingMatrix.R
Tried to write and use functions that bind matrices together but ran in class issues I don't understand.
Error with rbind.fill.matrix(S1#data,S2#data): Error in as.vector(data) : no method for coercing this S4 class to a vector

Mean from multiple Columns (Error Message)

I'm still fairly new to R and have been practicing a bit lately.
I have the following (simplified) Data Set:
So it's basically a Questionnaire asking random People which of these Cities they prefer from 1-7.
I would like to find out which city has the highest average preference.
So what I first did was: mean(dataset[, 3], na.rm=TRUE) to find out the average preference for Prag. That worked!
Now I wanted to create a table which shows me every mean of each city.
My thought was: table(mean(dataset[3:8], na.rm=TRUE))
However, all I get is the following Error Message:
In mean.default(umfrage[37:38], na.rm = TRUE) :
argument is not numeric or logical: returning NA**
Does someone know what that means and how I could achieve the result?
I figured it out.
I simply used this function: lapply(dataset[3:8], mean, na.rm = TRUE)
You could also use dplyr and tidyr package (both packages are integrated in the tidyverse package):
library(tidyverse)
result <- dataset %>%
gather("city", "value", Pref_Prague:Pref_London) %>%
group_by(city) %>%
summarise(mean = mean(value))

Fatal error with min_rank inside mutate with a data frame

I am trying to practice window functions and just want to experiment with this code but my R session aborts when I run this code:
library(dplyr)
mtcars %>%
select(mpg, cyl) %>%
mutate(r = min_rank())
If these functions behave differently depending if context is a df or vector how should we know what they do? All the examples are for vectors... E.g. row_number()behaves differently on a data frame compared to a vector.

How to filter on partial match using sparklyr

I'm new to sparklyr (but familiar with spark and pyspark), and I've got a really basic question. I'm trying to filter a column based on a partial match. In dplyr, i'd write my operation as so:
businesses %>%
filter(grepl('test', biz_name)) %>%
head
Running that code on a spark dataframe however gives me:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GREPL'. This function is neither a registered temporary function nor a permanent function registered in the database 'project_eftpos_failure'.; line 5 pos 7
The same as in standard Spark, you can use either rlike (Java regular expressions):
df <- copy_to(sc, iris)
df %>% filter(rlike(Species, "osa"))
# or anchored
df %>% filter(rlike(Species, "^.*osa.*$"))
or like (simple SQL regular expressions):
df %>% filter(like(Species, "%osa%"))
Both methods can be also used with suffix notation as
df %>% filter(Species %rlike% "^.*osa.*$")
and
df %>% filter(Species %like% "%osa%")
respectively.
For details see vignette("sql-translation").

Using dplyr to compare models

I was working through the examples in the dplyr documentation of the do() function and all was well until I came across this snippet to summarize model comparisons: # compare %>% summarise(p.value = aov$`Pr(>F)`) The error was "Error: expecting a single value". So I found a way forward accessing the list of aov elements directly. This question is about sub-setting operators and to ask if there is a better way to do this. Here is my full attempt and solution.
models <- group_by(mtcars,cyl) %>% do(mod_lin = lm(mpg ~ disp, data = .), mod_quad = lm(mpg ~ poly(disp,2), data = .))
compare <- models %>% do(aov = anova(.$mod_lin, .$mod_quad))
compare %>% summarise(p.value = aov$'Pr(>F)')
Error: expecting a single value
Looking into the structure of compare
select comparison 1
compare$aov[[1]]
select comparison 1 and all of element 6 (the pvalues)
compare$aov[[1]][6]
just the pvalues
compare$aov[[1]][2,6]
compare %>% summarise(pvalue = aov[2,6]) # this gets the pvalues by group
So I suppose I'm wondering how with an object of classes (‘rowwise_df’, ‘tbl_df’ and 'data.frame') that summarise can intuit the [[]] operator. And also if there might be a better way to do this.
You could try
compare %>% do(.$aov['Pr(>F)']) %>% na.omit()

Resources