Error when filering on rowSums using dplyr - r

I have the following df where df <- data.frame(V1=c(0,0,1),V2=c(0,0,2),V3=c(-2,0,2))
If I do filter(df,rowSums!=0) I get the following error:
Error in filter_impl(.data, quo) :
Evaluation error: comparison (6) is possible only for atomic and list types.
Does anybody know why is that?
Thanks for your help
PS: Plain rowSums(df)!=0 works just fine and gives me the expected logical

A more tidyverse style approach to the problem is to make your data tidy, i.e., with only one data value.
Sample data
my_mat <- matrix(sample(c(1, 0), replace=T, 60), nrow=30) %>% as.data.frame
Tidy data and form implicit row sums using group_by
my_mat %>%
mutate(row = row_number()) %>%
gather(col, val, -row) %>%
group_by(row) %>%
filter(sum(val) == 0)
This tidy approach is not always as fast as base R, and it isn't always appropriate for all data types.

OK, I got it.
filter(df,rowSums(df)!=0)
Not the most difficult one...
Thanks.

Related

How to bind Binary Rating Matrices with different columns?

I'm currently working with the package *recommenderlab *and have run into some memory issues because I work with a lot of data. The problem lies in the creation of the matrices, so I thought I could solve this by using a function, that merges small matrices together to one big matrix.
S1 <- S1 %>%
select(SessionID, material_number) %>%
mutate(value = 1) %>%
spread(material_number,value, fill = 0) %>%
select(-SessionID) %>%
as.matrix() %>%
as("binaryRatingMatrix")
S2 <- S2 %>%
select(SessionID, material_number) %>%
mutate(value = 1) %>%
spread(material_number,value) %>%
select(-SessionID) %>%
as.matrix() %>%
as("binaryRatingMatrix")
Now I want to somehow bind these 2 matrices. Is this possible and do you have some ideas? I tried so many different approaches and run in many errors...
If you have any creative other ideas to fight the memory issues, I will look forward to discuss these with you :)
That is the link to the package/class: https://github.com/cran/recommenderlab/blob/master/R/binaryRatingMatrix.R
Tried to write and use functions that bind matrices together but ran in class issues I don't understand.
Error with rbind.fill.matrix(S1#data,S2#data): Error in as.vector(data) : no method for coercing this S4 class to a vector

how to remove duplicate rows in R within Arrow?

I work with the arrow dataset to reduce the RAM usage but I met with the following problem.
I need to remove duplicate rows. With dplyr I can do it using distinct() but this function doesn't supported in Arrow.
Any ideas?
Following to recommendations I wrote the following code
Sales_2021 <- Sales_2021 %>%
group_by(`Cust-Item-Loc`) %>%
arrange(desc(SBINDT)) %>%
distinct(`Cust-Item-Loc`, .keep_all = TRUE) %>%
collect()
and got the Error message
Error: `distinct()` with `.keep_all = TRUE` not supported in Arrow
How can I slice the first rows?
The advice with filter(!duplicate()) is not working as well.
Sales_2021 <- Sales_2021 %>%
group_by(`Cust-Item-Loc`) %>%
arrange(desc(SBINDT)) %>%
filter(!duplicated(`Cust-Item-Loc`)) %>%
collect()
Error message
Error: Filter expression not supported for Arrow Datasets: !duplicated(`Cust-Item-Loc`)
Call collect() first to pull data into R.

Is it possible to count by using the count function within across()?

Hello R and tidyverse wizards,
I try to count the rows of the starwars data set to know how many observations we get with the variables "height" and "mass"
.
I managed to get it with this code:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = ~ n(),
mean = mean,
sd = sd))) %>%
View()
I would like to replace the obs = ~ n() by the count function and tried this version:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = count,
mean = mean,
sd = sd))) %>%
View()
but it was too simple to work, classic :p
I had this error message --> Error in View : Problem while computing ..1 = across(...)
And when I got rid of the View() function, I had another error message --> Error in summarise():
! Problem while computing ..1 = across(...).
Caused by error in across():
! Problem while computing column height_obs.
Caused by error in UseMethod():
! no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
So, I got two questions:
could someone please explain why the code worked with ~ n() but not with count?
is it possible to use the count function instead of ~ n() in that case?
Sorry if it is a dumb question but I just try to understand the across and the count functions by playing with it.
In the function description it says that "df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())", so I assume that using count() within across results in something like a double summarize-command, hence the use in favor of n().
Edit: Here you find the solution in the comment by G. Grothendieck
What is the difference between n() and count() in R? When should one favour the use of either or both?
n() returns a number
count() returns a dataframe
count() takes a dataframe as its first argument. It then returns counts for columns within that dataframe, passed as additional arguments. e.g.,
library(dplyr)
count(starwars, mass, height)
When you put count() inside across(), it passes columns to count() without including the dataframe as the first argument. Equivalent to if you ran,
count(starwars$mass, starwars$height)
Because count() expects a dataframe as the first argument, it throws an error.
n(), on the other hand, doesn’t take any arguments, and simply counts rows in the current environment (or group). You have to include the ~, as otherwise it will try passing each column to n(), which causes an error since n() doesn’t expect arguments.

How to use group_by() with rep_len() r

Let me know if I need a dummy example for this but essentially I have a df of subgroups, each subgroup a different length (typically 30-35k values). I'd like to bind in a vector with partial vector recycling of c(1:200). From this question I figure I can use rep_len() to get around the dataframe's anti-partial-recycling. The problem is, I can't define length.out in rep_len(), as length.out changes with each subgroup. Any help would be appreciated. I tried doing this:
df_new <- df %>%
group_by(subgroup) %>%
mutate(newcol <- rep_len(1:200, length.out=.))
Which threw an invalid length.out error. I also tried
df_new <- df %>%
group_by(subgroup) %>%
mutate(newcol <- rep_len(1:200, length.out=nrow(.)))
But this throws an error that length.out is the length of my entire df, not the previous subgroup. Any help would be appreciated!
The dplyr package has a count function n() which could work.
mtcars %>%
group_by(cyl) %>%
mutate(newcol = rep_len(1:200, length.out=n()))
Also in the mutate statement it should be a "=" and not "<-"

Applying map function to a nested tibble in R

I'm trying to replicate an 'old' R script I found for the tidyverse package.
library(dslabs)
DataTib<-as_tibble(us_contagious_diseases)
DataTib_nested <- DataTib %>%
group_by(disease) %>%
nest()
Mean_count_nested <- DataTib_nested %>%
mutate(mean_count = map(.x=DataTib_nested$data, ~mean(.x$count)))
As I understand, I have a tibble where data was grouped by disease and the remaining variables/data were nested, and then I'm trying to add a new column which should represent the average for variable "count" on that nested dataframe.
But I get the error, which I don't quite understand:
Error: Problem with `mutate()` input `mean_count`.
x Input `mean_count` can't be recycled to size 1.
i Input `mean_count` is `map(.x = DataTib_nested$data, ~mean(.x$count))`.
i Input `mean_count` must be size 1, not 7.
i The error occured in group 1: disease = "Hepatitis A".
Thanks in advance and best regards!
Your syntax is slightly wrong:
DataTib_nested <- DataTib %>%
group_by(disease) %>%
nest(data = - disease)
Mean_count_nested <- DataTib_nested %>%
mutate(mean_count = map_dbl(data, ~mean(.x$count)))
Note that I use map_dbl
instead of map since the return value is numeric.

Resources