dplyr where in other table - r

I am looking for dplyr's variant to the SQL where in clause.
My goal is to look filter rows based upon their presence in a column of an other table. The code I have currently returns an error. I am guessing this is due to incorrectly pulling out the data from the second table to compare to the first.
Instruments %>% filter(name %in% distinct(Musician$plays_instrument))
I wrote an example similar to what I've got currently above this line. I am guessing that my mistake can be see in the syntax I am using. If not I can provide a working example if needed. Just takes some time to build it and I was hoping I got get this solved more quickly.

Probably, you should use unique since distinct requires a dataframe as first argument.
library(dplyr)
Instruments %>% filter(name %in% unique(Musician$plays_instrument))

We can use subset in base R
subset(Instruments, name %in% unique(Musician$plays_instrument))

Related

Working with a vector of variables in dplyr. Is there any "any_of" equivalence in dplyr for not selecting?

Regarding doing mutations and simple transformations not depending on variables with names that can be matched via "starts_with", "ends_with" etc, but just variables stored in a vector variable.
Is there a way, pe, to do transformations on the variables stored in a vector? I bet there is, but despise my efforts and consulting the list of dplyr functions I can't find it.
For now I have tried all the usual selectors like any_of, and all_of, but at the moment, they keep failing because it's not in a selecting method.
So. I'm stuck with this that doesnt work.
myvars = c("Sepal.Length","Sepal.Width")
iris2 = iris %>% rowwise() %>%
mutate(geo.mean= psych::geometric.mean(any_of(myvars)))
# It should do something like this:
iris %>% rowwise() %>%
mutate(geo.mean = psych::geometric.mean(c(Sepal.Length,Sepal.Width)) )
I have also take a look at pick() but Im currently trying to reinstall dplyr in order to update it from my 1.0.1 to the 1.1.0 version to no avail. (It keeps telling me to restart the session despise there are no packages loaded that could prevent it to update, I'm currently on it). Thing is, from the documentation, its like a subset and I intend to keep all my variables.
Yes, you can do this using the 'c_across()' function which lets you use tidy selection syntax:
iris %>%
rowwise() %>%
mutate(geo.mean = psych::geometric.mean(c_across(all_of(myvars))))

Sample data after using filter or select from sparkly

I have a large dataframe to analyse, so I'm using sparklyr to manage it in a fast way. My goal is to take a sample of the data, but before I need to select some variables of interest and filter some values of certain columns.
I tried to select and/or filter the data and then use the function sample_n but it always gives me this error:
Error in vapply(dots(...), escape_expr, character(1)) : values must
be length 1, but FUN(X[[2]]) result is length 8
Below is an example of the behaviour:
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-select(data_example,Sepal_Length,Sepal_Width,Petal_Length)
data_sample<-sample_n(data_select,25)
data_sample
I don't know if I'm doing something wrong, since I started using this package a few days ago, but I could not find any solution to this problem. Any help with be appreciated!
It seemed a problem with the type of object returned when you select/mutate/filter the data.
So, I managed to get around the problem by sending the data to spark using the compute() command, and then sampling the data.
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-data_example %>%
select(Sepal_Length,Sepal_Width,Petal_Length) %>%
compute('data_select')
data_sample<-sample_n(data_select,25)
data_sample
Unfortunatelly, this approach takes a long time to run and consumes a lot of memory, so I expect someday I'll find a better solution.
I had also get same issue earlier then I tried following:
data_sample = data_select %>% head(25)

R - fix sorting when using anti_join to remove stop words (creating ngrams)

Very new to R and coding, and trying to do a frequency analysis on a long list of sentences and their given weighting. I've un-nested and mutated the data, but when I try to remove stop words, the sort order of words within each sentence gets randomized. I need to create bigrams later on, and would prefer if they're based on the original phrase.
Here's the relevant code, can provide more if insufficient:
library(dplyr)
library(tidytext)
data = data%>%
anti_join(stop_words)%>%
filter(!is.na(word))
What can I do to retain the original sort order within each sentence? I have all the words in a sentence indexed so I can match them to their given weight. Is there a better way to remove stop words that doesn't mess up the sort order?
Saw a similar question here but it's unresolved: How to stop anti_join from reversing sort order in R?
Also tried this but didn't work: dplyr How to sort groups within sorted groups?
Got help from a colleague in writing this but unfortunately they're not available anymore so any insight will be helpful. Thanks!
You could add a sort-index to your data before sorting
library(dplyr)
library(tidytext)
data = data %>%
dplyr::mutate(idx = 1:n()) %>%
dplyr::anti_join(stop_words) %>%
dplyr::filter(!is.na(word)) %>%
dplyr::arrange(idx)
(the dplyr:: is not necessary, but helps you to remember where function comes from)

Troubles when grouping a column data in R

all.
I´m trying to extract a vector from a column that contains all the values that belong to a specific type.
To be more concise, I have a table with 5000 rows. The column "Types" can take the values A,B,C or D. I need to extract all the rows that belong to a specific type, like A.
I want to use the function group_by from the library dplyr and I´m trying:
dplyr::group_by(my_database,type)
or even
dplyr::group_by(my_database,type,A)
but I don´t get what I need. Does anyone know how to proceed?
NOTE: I can´t use the function %>% because for some strange reason R says "could not find %>% function".
I appreciate a lot your kind responses in advance.
R can't find %>% because you have not loaded a library which uses it. For example:
library(dplyr)
The function that you're looking for is filter.
my_database %>% filter(Types == "A")
Spend some time with the dplyr documentation (introduction) or a good tutorial.

Using dplyr::filter, how can the output be limited to just first 500 rows?

dplyr is a great and fast library.
Using the %>% operator enables powerful manipulation.
In my first step, I need to limit the output to only 500 rows max (for display purposes).
How can I do that?
par<-filter(pc,Child_Concept_GID==as.character(mcode)) %>% select(Parent_Concept_GID)
what I need is something like
filter(pc,CONDITION,rows=500)
Is there direct way or a nice workaround without making the first step a separate step (outside the %>% "stream")
There are a couple of ways to do this. Assuming you are pipe-lining your data (using %>%)
top_n(tn) works with grouped data. It will not return tn rows, if the data is sorted with arrange()
head(500) takes the first 500 rows (can be used after arrange(), for example)
sample_n(size=500) can be used to select 500 arbitrary rows
If you are looking for the R equivalent to SQL's LIMIT, use head().
I think you're actually looking for slice() here.
filter(pc, condition) %>% slice(1:500)
This does not rank the results. It merely pulls a slice, by position. In this case positions 1 through 500.
If this is coming from a relational db, head is a better option.
To limit the output of filter, one can add after filter a function
top_n()
credit goes to commenter joran
solution
par<-filter(pc,Child_Concept_GID==as.character(mcode)) %>% top_n(500) %>% select(Parent_Concept_GID)

Resources