R - fix sorting when using anti_join to remove stop words (creating ngrams) - r

Very new to R and coding, and trying to do a frequency analysis on a long list of sentences and their given weighting. I've un-nested and mutated the data, but when I try to remove stop words, the sort order of words within each sentence gets randomized. I need to create bigrams later on, and would prefer if they're based on the original phrase.
Here's the relevant code, can provide more if insufficient:
library(dplyr)
library(tidytext)
data = data%>%
anti_join(stop_words)%>%
filter(!is.na(word))
What can I do to retain the original sort order within each sentence? I have all the words in a sentence indexed so I can match them to their given weight. Is there a better way to remove stop words that doesn't mess up the sort order?
Saw a similar question here but it's unresolved: How to stop anti_join from reversing sort order in R?
Also tried this but didn't work: dplyr How to sort groups within sorted groups?
Got help from a colleague in writing this but unfortunately they're not available anymore so any insight will be helpful. Thanks!

You could add a sort-index to your data before sorting
library(dplyr)
library(tidytext)
data = data %>%
dplyr::mutate(idx = 1:n()) %>%
dplyr::anti_join(stop_words) %>%
dplyr::filter(!is.na(word)) %>%
dplyr::arrange(idx)
(the dplyr:: is not necessary, but helps you to remember where function comes from)

Related

Most efficient algorithm to filter dataframe based on two nested conditions in R?

Im currently working with a really large dataframe (~2M rows) about "landings" and "takes off". With some information like the time the operation happened, in which airport, where was it heading and so on.
What I want to do is to filter the whole DF into a new one that just consider "flights", so about half the entries matching each take off with its corresponding landing based on the airport codes of the origin airport and the destination airport.
What I did, that works but considering how large the DF it takes about 200 hours to complete is
Loop on all rows of DF checking for some df$Operation=="takeoff"{
Loop on all rows, below the row found before, for df$operation="ladning"
where codes of origin and destination airport match the "take off" entry{
Once found i add the data i need to the new df called Flights
}
}
(If the second loop does not find a match in the next 100 rows it discards the entry and searchs for the next "take off")
Is there a function that perfoms this operation in a more efficient way? If not, do you know of an algorithm that could be way faster than the one i did?
I am really not used to data science, nor R. Any help will be appreciated.
Thanks in advance!
In R we try to avoid using loops. For filtering a dataframe I would use the filter function in dplyr. dplyr is great and easy and fast for working with dataframes. If it's still not fast enough you can try data.table, but it's a bit less user friendly.
This does what you want I think.
library(dplyr)
flights <- df %>%
arrange(datetime) %>% # make sure the data is in the right order
group_by(origin, destination) %>% # for each flight path
dplyr::filter(Operation %in% c("takeoff", "landing")) # get these rows
I recommend the online book R For Data Science:
https://r4ds.had.co.nz/

group_by() and unique() in R both return a duplicate

I have a list of samples. Each sample has a country of origin. There is a column that is country.
I run the following code:
country_counts <- metadata %>%
group_by(country) %>%
count()
For 192 countries, this works. Romania, however, is duplicated. I have done everything I know to fix this. I have removed the white spaces, I have used str_replace. Still I am left with the duplicate. When I use str_replace and replace with "apple", I am left with two "apple" variables. I am not sure what left to do. I have also tried copying the column and copying the entire database. Still nothing works for me. Any advice?
My guess is that there is some issue with character in the string. Looks the same but is a different unicode character.
My quick and dirty solution would be to use str_detect to replace all Romania-like strings.
metadata$country[str_detect(metadata$country, "Ro")] <- "Romania"
You have to adjust the pattern in str_detect to make it work in your specific case.

How to mutate the values in a R list object?

I have a list named "binom" that looks as follows:
"estimate_" values are probabilities that I want to reverse (to do a calculation "1-value"). How to mutate these values in this list?
I googled but did not find a code for doing this. I need the list afterwards as a list for plotting.
Try looking at ?base::transform or ?dplyr::mutate
You will first need to subset your list to the element you want to manipulate:
library(dplyr)
binom[[1]] %>%
mutate(newcol = 1 - estimate_)
You can learn more about data transformation here
In the future, it's helpful to provide a mock dataset with your question instead of a screenshot, so that people have something to work with when attempting to answer your questions.

dplyr where in other table

I am looking for dplyr's variant to the SQL where in clause.
My goal is to look filter rows based upon their presence in a column of an other table. The code I have currently returns an error. I am guessing this is due to incorrectly pulling out the data from the second table to compare to the first.
Instruments %>% filter(name %in% distinct(Musician$plays_instrument))
I wrote an example similar to what I've got currently above this line. I am guessing that my mistake can be see in the syntax I am using. If not I can provide a working example if needed. Just takes some time to build it and I was hoping I got get this solved more quickly.
Probably, you should use unique since distinct requires a dataframe as first argument.
library(dplyr)
Instruments %>% filter(name %in% unique(Musician$plays_instrument))
We can use subset in base R
subset(Instruments, name %in% unique(Musician$plays_instrument))

Troubles when grouping a column data in R

all.
I´m trying to extract a vector from a column that contains all the values that belong to a specific type.
To be more concise, I have a table with 5000 rows. The column "Types" can take the values A,B,C or D. I need to extract all the rows that belong to a specific type, like A.
I want to use the function group_by from the library dplyr and I´m trying:
dplyr::group_by(my_database,type)
or even
dplyr::group_by(my_database,type,A)
but I don´t get what I need. Does anyone know how to proceed?
NOTE: I can´t use the function %>% because for some strange reason R says "could not find %>% function".
I appreciate a lot your kind responses in advance.
R can't find %>% because you have not loaded a library which uses it. For example:
library(dplyr)
The function that you're looking for is filter.
my_database %>% filter(Types == "A")
Spend some time with the dplyr documentation (introduction) or a good tutorial.

Resources