Speed up string filtering in R - r

I have a dataframe with a column of codes separated by commas. I am currently filtering this dataframe by looking through the code column and if a code appears from the list, I keep that row. My issue is that this dataframe is expanding, as is the list of acceptable codes, so I'd like to speed this process up if possible. Ideally there would be a way to mark a row as already checked and if a good code was in the row, then to not have to check it again for all the other acceptable codes.
Current dataframe looks something like this:
Code_column
,12ab,
,12ab,123b,
,456t,345u,
,12ab,789p,
list of good codes:
good_codes <- c(',123b,', ',456t,', ',345u,')
My filtering process currently:
df %>%
filter(sapply(`Code_column`,
function(x) any(sapply(good_codes, str_detect, string = x))) == TRUE)
Final column
Code_column
,12ab,123b,
,456t,345u,

I think we do not need sapply
df[str_detect(df$Code_column,paste(good_codes, collapse = '|')),]
[1] ",12ab,123b," ",456t,345u,"
You can pass | & to str_detect
paste(good_codes, collapse = '|')
[1] ",123b,|,456t,|,345u,"

Related

Issue with %in% in R

I am trying to get all sentences from a dataframe containing specific words into a new dataframe. I don't really know how to do this, but the first step I tried was to check if a word is in the column.
> "quality" %in% df$text[2]
[1] FALSE
> df$text[2]
[1] "Audio quality is definitely good"
Why is the output false?
Also, do you have any suggestion on how to create my new dataframe? I'd like to, as an example, have a dataframe with all words containing c("word1","word2").
Thank you very much in advance.
It is not a fixed match. If we need to partially match, use grepl
grepl("quality", df$text[2])
If we are doing this to check if there are any 'quality' in the column, wrap with any
any(grepl("quality", df$text))
For multiple elements, paste them together with collapse = "|"
v1 <- c("word1","word2")
any(grepl(paste(v1, collapse="|"), df$text))
According to ?%in%
%in% is currently defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
where match matches the string based on an exact match.

Finding All occurrences of a string in a vector (grep(), which() functions don't quite work)

I have this spreadsheet and want to see how often the string "Drake" appears.
For example, some rows say "Drake, Kendrick Lamar, Post Malone" , but they're not being counted with the code I have:
data <- read.csv("C:/Users/Gabriel/Documents/responses.csv", header = TRUE)
artist <- data$artists
grep("Drake$", artist)
artistcount <- which('Drake' == artist)
artistcount
the results I get from grep() or which() are both
# 7 47 71
I want ALL rows where "Drake" appears. This code shows me which rows had "Drake" as the ONLY string. It should be way more than just 3 rows.
I appreciate any feedback.
Here's an example of the data in the "artists" column:
This can be done using the filter function from dplyr and str_detech from stringr.
library(stringr)
library(dplyr)
data <- read.csv(choose.files())
drake <- data %>%
filter(str_detect(artists, "Drake"))

Efficient way of searching for a string in a data frame

I would like to come up with an efficient way of finding a string in a data.frame including the values stored in row names.
Present approach
Primitively, I can achieve that running this code:
data(mtcars)
mtcars$rows <- row.names(mtcars)
sapply(mtcars, function(x) { grep("mazda",x, ignore.case = TRUE) })
I don't like it as it returns data for all columns:
> length(sapply(mtcars, function(x) { grep("mazda",x, ignore.case = TRUE) }))
[1] 12
I would like to prettify this code so it only returns:
column name for successful match
row name for successful match
in format column X row
Additional considerations
Following suggestions expressed in comments, it occurred to me that I would also like to search column names, if possible.

Split a character column from a dataframe based on specific token

I have a dataframe df and the first column looks like this:
[1] "760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353"
I want to split that column on -.
What I'm doing is
strsplit(df[,1], "-")
The problem is that it's not working. It returns me a list without splitting the elements. I already tried adding the parameter fixed = TRUE and putting a regular expressing on the split parameter but nothing worked.
What is weird is that if I replicate the column on my own, for example:
myVector <- c("760–563" "01455–1" "4672–04" "11–31234" "22–12" "11111–53" "111–21" "17–356239" "14–22352" "531–353")
and then apply the strsplit, it works.
I already checked my column type and class with
class(df[,1]) and typeof(df[,1]) and both returns me character, so it's good.
I was also using the dataframe with dplyr so it was of the type tbl_df. I converted it back to dataframe but didn't work too.
Also tried apply(df, 2, function(x) strsplit(x, "-", fixed = T)) but didn't work too.
Any clues?
I don't know how you did it, but you have two different types of dashes:
charToRaw(substr("760–563", 4, 4))
#[1] 96
charToRaw("-")
#[1] 2d
So the strsplit() is working just fine, it's just that the dash isn't there in your original data. Adjust this, and away you go:
strsplit("760–563", "–")
#[[1]]
#[1] "760" "563"
You can just split on a non-numeric character
library(dplyr)
library(tidyr)
data %>%
separate(your_column,
c("first_number", "second_number"),
sep = "[^0-9]")

Eliminate dataframe rows that match a character string

I have a dataframe rawdata with columns that contain ecological information. I am trying to eliminate all of the rows for which the column LatinName matches a vector of species for which I already have some data, and create a new dataframe with only the species that are missing data. So, what I'd like to do is something like:
matches <- c("Thunnus thynnus", "Balaenoptera musculus", "Homarus americanus")
# obviously these are a random subset; the real vector has ~16,000 values
rawdata_missing <- rawdata %>% filter(LatinName != "matches")
This doesn't work because the boolean operator can't be applied to a character string. Alternatively I could do something like this:
rawdata_missing <- filter(rawdata, !grepl(matches, LatinName)
This doesn't work either because !grepl also can't use the character string.
I know there are a lot of ways I could subset rawdata using the rows where LatinName IS in matches, but I can't figure out a neat way to subset rawdata such that LatinName is NOT in matches.
Thanks in advance for the help!
filteredData <- rawdata[!(rawdata$LatinName %in% Matches), ]
Another way by using subset, paste, mapply and grepl is...
fileteredData <- subset(rawdata,mapply(grepl,rawdata$LatinName,paste(Matches,collapse = "|")) == FALSE)

Resources