I'm currently working with the package *recommenderlab *and have run into some memory issues because I work with a lot of data. The problem lies in the creation of the matrices, so I thought I could solve this by using a function, that merges small matrices together to one big matrix.
S1 <- S1 %>%
select(SessionID, material_number) %>%
mutate(value = 1) %>%
spread(material_number,value, fill = 0) %>%
select(-SessionID) %>%
as.matrix() %>%
as("binaryRatingMatrix")
S2 <- S2 %>%
select(SessionID, material_number) %>%
mutate(value = 1) %>%
spread(material_number,value) %>%
select(-SessionID) %>%
as.matrix() %>%
as("binaryRatingMatrix")
Now I want to somehow bind these 2 matrices. Is this possible and do you have some ideas? I tried so many different approaches and run in many errors...
If you have any creative other ideas to fight the memory issues, I will look forward to discuss these with you :)
That is the link to the package/class: https://github.com/cran/recommenderlab/blob/master/R/binaryRatingMatrix.R
Tried to write and use functions that bind matrices together but ran in class issues I don't understand.
Error with rbind.fill.matrix(S1#data,S2#data): Error in as.vector(data) : no method for coercing this S4 class to a vector
Related
I have some table data that has been scattered across around 1000 variables in a dataset. Most are split across 2 variables, and I can piece together the data using coalesce, however this is pretty inefficient for some variables which are instead spread across >10. Is there are a better/more efficient way?
The syntax I have written so far is:
scattered_data <- df %>%
select(id, contains("MASS9A_E2")) %>%
#this brings in all the variables for this one question that start with this string
mutate(speciality = coalesce(MASS9A_E2_C4_1,MASS9A_E2_C4_2,MASS9A_E2_C4_3, MASS9A_E2_C4_4, MASS9A_E2_C4_5, MASS9A_E2_C4_6, MASS9A_E2_C4_7, MASS9A_E2_C4_8, MASS9A_E2_C4_9, MASS9A_E2_C5_1,MASS9A_E2_C5_2,MASS9A_E2_C5_3, MASS9A_E2_C5_4, MASS9A_E2_C5_5, MASS9A_E2_C5_6, MASS9A_E2_C5_7, MASS9A_E2_C5_8, MASS9A_E2_C5_9))
As I have this for 28 MASS questions and would really love to be able to collapse these down a bit quicker.
You can use do.call() to take all columns except id as input of coalesce().
library(dplyr)
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = do.call(coalesce, select(df, -id)))
In addition, you can call coalesce() iteratively by Reduce().
df %>%
select(id, contains("MASS9A_E2")) %>%
mutate(speciality = Reduce(coalesce, select(df, -id)))
I'm working with a data frame and looking to calculate the mean age of players debut in baseball.
I can get the answer, however I am a bit confused why I get different outputs doing the same things 2 ways.
Firstly, when I run the below code I get the correct mean:
mean(as.numeric(players$debut_age)/365, na.rm=TRUE)
But when I reorganize this as a pipe, it instead only prints the vector of days in debut_age:
players$debut_age %>% as.numeric()/365 %>% mean(na.rm=TRUE)
I'm sure there is something simple I'm missing, but I would like to know why these don't produce the same result.
We can use divide_by
library(dplyr)
players$debut_age %>%
as.numeric() %>%
magrittr::divide_by(365) %>%
mean(na.rm = TRUE)
Or place the as.numeric with / inside a block of {}
players$debut_age %>%
{as.numeric()/365} %>%
mean(na.rm=TRUE)
I have the following df where df <- data.frame(V1=c(0,0,1),V2=c(0,0,2),V3=c(-2,0,2))
If I do filter(df,rowSums!=0) I get the following error:
Error in filter_impl(.data, quo) :
Evaluation error: comparison (6) is possible only for atomic and list types.
Does anybody know why is that?
Thanks for your help
PS: Plain rowSums(df)!=0 works just fine and gives me the expected logical
A more tidyverse style approach to the problem is to make your data tidy, i.e., with only one data value.
Sample data
my_mat <- matrix(sample(c(1, 0), replace=T, 60), nrow=30) %>% as.data.frame
Tidy data and form implicit row sums using group_by
my_mat %>%
mutate(row = row_number()) %>%
gather(col, val, -row) %>%
group_by(row) %>%
filter(sum(val) == 0)
This tidy approach is not always as fast as base R, and it isn't always appropriate for all data types.
OK, I got it.
filter(df,rowSums(df)!=0)
Not the most difficult one...
Thanks.
I am new to multidplyr. I have a dataset similar to what this creates:
library(multidplyr)
library(tidyverse)
library(nycflights13)
f<-flights %>% group_by(month) %>% nest()
Now I´d like to do operations on each of these tibbles on different nodes.
cluster <- create_cluster(12)
f2<-partition(f,month,cluster=cluster)
everything seems ok until here, but when I do:
models<-f2 %>%
do(mod=lm(arr_delay~dey_delay,data=.))
I get the following error msg:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
12 nodes produced errors; first error: object 'arr_delay' not found
Now if I try
f2 %>% browser(.)
and then try .$ I do not have access to any of the columns-
Any ideas how these columns can be accessed?
This question has two parts:
1. Why are you getting an error using do?
The "proper" way to apply functions to a nested column (or "list column") is not to use do, but to use map instead. In this case, multidplyr isn't really important, since the normal dplyr code gives the same error.
f <- flights %>% group_by(month) %>% nest()
models <- f %>%
do(mod = lm(arr_delay ~ dey_delay, data = .))
Error in eval(expr, envir, enclos) : object 'arr_delay' not found
Using map from purrr on the other hand works fine.
models <- f %>%
mutate(model = purrr::map(data, ~ lm(arr_delay ~ dep_delay, data = .)))
Using your multidplyr code with mutate and map also works just fine.
2. How can I view the data in a party_df?
You can't easily do that. Remember they are not available in your current R session, but on the nodes. You can access the names using this little utility function:
names.party_df <- function(x) {
fun <- function(x) names(eval(x))
multidplyr::cluster_call(x$cluster, fun, as.name(x$name))[[1]]
}
But to access the full data, you'll most likely need to collect your data again. Alternatively, in RStudio one can use View, but note that this doesn't work great on large data sets.
This starts as an aestethic question but then turns into a functional one, specifically about magrittr.
I want to add a data_frame which is manually input to one that is already there as so:
cars_0 <- mtcars %>%
mutate(brand = row.names(.)) %>%
select(brand, mpg, cyl)
new_cars <- matrix(ncol = 3, byrow = T, c(
"VW Beetle", 25, 4,
"Peugeot 406", 42, 6)) # Coercing types is not an issue here.
cars_1 <- rbind(cars_0,
set_colnames(new_cars, names(cars_0)))
I'm writing the new cars in a matrix for "increased legibility", and therefore need to set the column names for it to be bound to cars_0.
If anyone likes magrittr as much as I do, they might want to present new_cars first and pipe it to set_colnames
cars_1 <- rbind(cars_0, new_cars %>%
set_colnames(names(cars_0)))
Or to avoid repetition they'll want to indicate cars_0 and pipe it to rbind
cars_1 <- cars_0 %>%
rbind(., set_colnames(new_cars, names(.)))
However one cannot do both as there is confusion about whom is being piped
cars_1 <- cars_0 %>%
rbind(., new_cars %>% set_colnames(names(.)))
## Error in match.names(clabs, names(xi)) :
## names do not match previous names
My question: Is there a way to distinguish the two arguments that are piped?
Short answer: no.
Longer answer: I'm not sure what the rationale for doing this would be. The philosophy behind magrittr was to unnest composite functions, with the primary intent of making it easier to read the code. For example:
f(g(h(x)))
becomes
h(x) %>% g() %>% f()
Trying to use pipes in a manner that places two objects to be interpreted as the . argument goes against the philosophy of simplification. There are circumstances in which you can have nested pipes, but the environments ought to remain distinct. Trying to cross two pipes in the same environment can be likened to crossing the streams.
Don't cross the streams :)