how to remove duplicate rows in R within Arrow? - r

I work with the arrow dataset to reduce the RAM usage but I met with the following problem.
I need to remove duplicate rows. With dplyr I can do it using distinct() but this function doesn't supported in Arrow.
Any ideas?
Following to recommendations I wrote the following code
Sales_2021 <- Sales_2021 %>%
group_by(`Cust-Item-Loc`) %>%
arrange(desc(SBINDT)) %>%
distinct(`Cust-Item-Loc`, .keep_all = TRUE) %>%
collect()
and got the Error message
Error: `distinct()` with `.keep_all = TRUE` not supported in Arrow
How can I slice the first rows?
The advice with filter(!duplicate()) is not working as well.
Sales_2021 <- Sales_2021 %>%
group_by(`Cust-Item-Loc`) %>%
arrange(desc(SBINDT)) %>%
filter(!duplicated(`Cust-Item-Loc`)) %>%
collect()
Error message
Error: Filter expression not supported for Arrow Datasets: !duplicated(`Cust-Item-Loc`)
Call collect() first to pull data into R.

Related

how can i use group_by in text mining with r

I am just wondering why I cant use group_by() in corpus-text.
I tried using some packages too but at the end nothing.
Also tried to convert to tibble.
My code:
data <- data %>%
group_by(Title) %>%
mutate(line = row_number()) %>%
ungroup()
The output:
Error:
! All columns in a tibble must be vectors.
✖ Column `text` is a `corpus_text` object.
Run `rlang::last_error()` to see where the error occurred.

googledrive unnest returns unused argument error

I am using googledrive to get information about a list of files on a shared drive, and would like to unnest(drive_resource) into columns for purpose of exploring the data.
When I do so, I receive an error. Appears to be something about the class of nested list I am trying to unnest as columns. Any suggestions?
library(dplyr)
library(tidyr)
library(googlesheets)
df <- drive_find(team_drive = "my_team_drive")
unnest(df, drive_resource)
Error in as_tibble.dribble(output, .name_repair = "minimal") :
unused argument (.name_repair = "minimal")
Turns out, there is a bug fix in dev for the .name_repair issue.
A quick-and-dirty solution appears below, improvements welcome.
df2 <- df %>%
map_dfr(
.x = .$drive_resource,
.f = ~ unlist(.x) %>% enframe() %>% spread(name, value)
) %>%
bind_cols(select(df, name:id))
If you only want the top level of that list object, the below is simpler. Especially good for team drives with a lot of users, since list of permissionIds gets turned into as many columns as you have users. Just call unnest_wider() for each of (parents, spaces, lastModifyingUser, capabilities, permissionIds, exportLinks, imageMediaMetadata, videoMediaMetadata) that you want to see information about.
Below courtesy of #JennyBryan
df2 <- df %>%
select(drive_resource) %>%
unnest_wider(drive_resource)

Group & Filter organization structure in with dplyr

when I organize my analysis as follows:
group_by(transactions.DF,MonthCode) %>%
filter(str_detect(transactions.DF$Description,"Innocean")) %>%
summarize(monthly.income = sum(Amount))
I get the following error:
Error: Result must have length 84, not 3029
when I organize my analysis as follows:
transactions.DF %>%
filter(str_detect(transactions.DF$Description,"Innocean")) %>%
group_by(.$MonthCode) %>%
summarize(monthly.income = sum(Amount))
I get my results.
I thought filter would maintain my grouping structure and allow for an analysis
Issue is that using the transactions.DF$ within the filter breaks the grouping and get the values from the entire column instead of the values of the 'Description' with each 'MonthCode'
library(dplyr)
library(stringr)
transactions.DF %>%
group_by(MonthCode) %>%
filter(str_detect(Description,"Innocean")) %>%
summarize(monthly.income = sum(Amount))
NOTE: The objectIdentifier$ is not needed within the tidyverse functions. It can be used in certain situations where we are extracting a column from another dataset and doing some comparison

Error when filering on rowSums using dplyr

I have the following df where df <- data.frame(V1=c(0,0,1),V2=c(0,0,2),V3=c(-2,0,2))
If I do filter(df,rowSums!=0) I get the following error:
Error in filter_impl(.data, quo) :
Evaluation error: comparison (6) is possible only for atomic and list types.
Does anybody know why is that?
Thanks for your help
PS: Plain rowSums(df)!=0 works just fine and gives me the expected logical
A more tidyverse style approach to the problem is to make your data tidy, i.e., with only one data value.
Sample data
my_mat <- matrix(sample(c(1, 0), replace=T, 60), nrow=30) %>% as.data.frame
Tidy data and form implicit row sums using group_by
my_mat %>%
mutate(row = row_number()) %>%
gather(col, val, -row) %>%
group_by(row) %>%
filter(sum(val) == 0)
This tidy approach is not always as fast as base R, and it isn't always appropriate for all data types.
OK, I got it.
filter(df,rowSums(df)!=0)
Not the most difficult one...
Thanks.

Subsetting on a tibble using "[]" gives "object not found" error

The article on dplyr here says "[]" (square brackets) can be used to subset filtered Tibbles like this:
filter(mammals, adult_body_mass_g > 1e7)[ , 3]
But I am getting an "object not found" error.
Here is the replication of the error on a more known dataset "iris"
library(dplyr)
iris %>% filter(Sepal.Length>6) [,c(1:3)]
Error in filter_(.data, .dots = lazyeval::lazy_dots(...)) :
object 'Sepal.Length' not found
I also want to mention that I am deliberately not preferring to use the native subsetting in dplyr using select() as I need a vector output and not a data frame on a single column. Unfortunately, dplyr always forces a data frame output (for good reasons).
You need an extra pipe:
iris %>% filter(Sepal.Length>6) %>% .[,1:3]
Sorry, forgot the . before the brackets.
Note: Your code will probably be more readable if you stick to the tidyverse syntax and use select as the last operation.
iris %>%
filter(Sepal.Length > 6) %>%
select(1:3)
The dplyr-native way of doing this is to use select:
iris %>% filter(Sepal.Length > 6) %>% select(1:3)
You could also use {} so that the filtering is done before [ is applied:
{iris %>% filter(Sepal.Length>6)}[,c(1:3)]
Or, as suggested in another answer, use the . notation to indicated where the data should go in relation to [:
iris %>% filter(Sepal.Length>6) %>% .[,1:3]
You can also load magrittr explicitly and use extract, which is a "pipe-able" version of [:
library(magrittr)
iris %>% filter(Sepal.Length>6) %>% extract( ,1:3)
The blog entry you reference is old in dplyr time - about 3 years old. dplyr has been changing a lot. I don't know whether the blog's suggestion worked at the time it was written or not, but I'd recommend finding more recent sources to learn about this frequently changing package.

Resources