Troubles when grouping a column data in R - r

all.
I´m trying to extract a vector from a column that contains all the values that belong to a specific type.
To be more concise, I have a table with 5000 rows. The column "Types" can take the values A,B,C or D. I need to extract all the rows that belong to a specific type, like A.
I want to use the function group_by from the library dplyr and I´m trying:
dplyr::group_by(my_database,type)
or even
dplyr::group_by(my_database,type,A)
but I don´t get what I need. Does anyone know how to proceed?
NOTE: I can´t use the function %>% because for some strange reason R says "could not find %>% function".
I appreciate a lot your kind responses in advance.

R can't find %>% because you have not loaded a library which uses it. For example:
library(dplyr)
The function that you're looking for is filter.
my_database %>% filter(Types == "A")
Spend some time with the dplyr documentation (introduction) or a good tutorial.

Related

How to change a dataframe's column types using tidy selection principles

I'm wondering what are the best practices to change a dataframe's column types ideally using tidy selection languages.
Ideally you would set the col types correctly up front when you import the data but that isn't always possible for various reasons.
So the next best pattern that I could identify is the below:
#random dataframe
df <- tibble(a_col=1:10,
b_col=letters[1:10],
c_col=seq.Date(ymd("2022-01-01"),by="day",length.out = 10))
My current favorite pattern involves using across() because I can use tidy selection verb to select variables that I want and then can "map" a formula to those.
# current favorite pattern
df<- df %>%
mutate(across(starts_with("a"),as.character))
Does anyone have any other favorite patterns or useful tricks here? It doesn't have to mutate. Often times I have to change the column types of dataframes with 100s of columns so it becomes quite tedious.
Yes this happens. Pain is where dates are in character format and if you once modify them and try to modify again (say in a mutate / summarise) there will be error.
In such a cases, change datatype only when you get to know what kind of data is there.
Select with names of columns id there is a sense in them
Check before applying the as.* if its already in that type with is.*
Applying it can be be by map / lapply / for loop, whatever is comfortable.
But it would be difficult to have a single approach for "all dataframes" as people try to name fields as per their choice or convenience.
Shared mine. Hope others help.

How to mutate the values in a R list object?

I have a list named "binom" that looks as follows:
"estimate_" values are probabilities that I want to reverse (to do a calculation "1-value"). How to mutate these values in this list?
I googled but did not find a code for doing this. I need the list afterwards as a list for plotting.
Try looking at ?base::transform or ?dplyr::mutate
You will first need to subset your list to the element you want to manipulate:
library(dplyr)
binom[[1]] %>%
mutate(newcol = 1 - estimate_)
You can learn more about data transformation here
In the future, it's helpful to provide a mock dataset with your question instead of a screenshot, so that people have something to work with when attempting to answer your questions.

dplyr where in other table

I am looking for dplyr's variant to the SQL where in clause.
My goal is to look filter rows based upon their presence in a column of an other table. The code I have currently returns an error. I am guessing this is due to incorrectly pulling out the data from the second table to compare to the first.
Instruments %>% filter(name %in% distinct(Musician$plays_instrument))
I wrote an example similar to what I've got currently above this line. I am guessing that my mistake can be see in the syntax I am using. If not I can provide a working example if needed. Just takes some time to build it and I was hoping I got get this solved more quickly.
Probably, you should use unique since distinct requires a dataframe as first argument.
library(dplyr)
Instruments %>% filter(name %in% unique(Musician$plays_instrument))
We can use subset in base R
subset(Instruments, name %in% unique(Musician$plays_instrument))

Sample data after using filter or select from sparkly

I have a large dataframe to analyse, so I'm using sparklyr to manage it in a fast way. My goal is to take a sample of the data, but before I need to select some variables of interest and filter some values of certain columns.
I tried to select and/or filter the data and then use the function sample_n but it always gives me this error:
Error in vapply(dots(...), escape_expr, character(1)) : values must
be length 1, but FUN(X[[2]]) result is length 8
Below is an example of the behaviour:
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-select(data_example,Sepal_Length,Sepal_Width,Petal_Length)
data_sample<-sample_n(data_select,25)
data_sample
I don't know if I'm doing something wrong, since I started using this package a few days ago, but I could not find any solution to this problem. Any help with be appreciated!
It seemed a problem with the type of object returned when you select/mutate/filter the data.
So, I managed to get around the problem by sending the data to spark using the compute() command, and then sampling the data.
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-data_example %>%
select(Sepal_Length,Sepal_Width,Petal_Length) %>%
compute('data_select')
data_sample<-sample_n(data_select,25)
data_sample
Unfortunatelly, this approach takes a long time to run and consumes a lot of memory, so I expect someday I'll find a better solution.
I had also get same issue earlier then I tried following:
data_sample = data_select %>% head(25)

R - fix sorting when using anti_join to remove stop words (creating ngrams)

Very new to R and coding, and trying to do a frequency analysis on a long list of sentences and their given weighting. I've un-nested and mutated the data, but when I try to remove stop words, the sort order of words within each sentence gets randomized. I need to create bigrams later on, and would prefer if they're based on the original phrase.
Here's the relevant code, can provide more if insufficient:
library(dplyr)
library(tidytext)
data = data%>%
anti_join(stop_words)%>%
filter(!is.na(word))
What can I do to retain the original sort order within each sentence? I have all the words in a sentence indexed so I can match them to their given weight. Is there a better way to remove stop words that doesn't mess up the sort order?
Saw a similar question here but it's unresolved: How to stop anti_join from reversing sort order in R?
Also tried this but didn't work: dplyr How to sort groups within sorted groups?
Got help from a colleague in writing this but unfortunately they're not available anymore so any insight will be helpful. Thanks!
You could add a sort-index to your data before sorting
library(dplyr)
library(tidytext)
data = data %>%
dplyr::mutate(idx = 1:n()) %>%
dplyr::anti_join(stop_words) %>%
dplyr::filter(!is.na(word)) %>%
dplyr::arrange(idx)
(the dplyr:: is not necessary, but helps you to remember where function comes from)

Resources