I am trying to get all sentences from a dataframe containing specific words into a new dataframe. I don't really know how to do this, but the first step I tried was to check if a word is in the column.
> "quality" %in% df$text[2]
[1] FALSE
> df$text[2]
[1] "Audio quality is definitely good"
Why is the output false?
Also, do you have any suggestion on how to create my new dataframe? I'd like to, as an example, have a dataframe with all words containing c("word1","word2").
Thank you very much in advance.
It is not a fixed match. If we need to partially match, use grepl
grepl("quality", df$text[2])
If we are doing this to check if there are any 'quality' in the column, wrap with any
any(grepl("quality", df$text))
For multiple elements, paste them together with collapse = "|"
v1 <- c("word1","word2")
any(grepl(paste(v1, collapse="|"), df$text))
According to ?%in%
%in% is currently defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
where match matches the string based on an exact match.
Related
I would like to remove rows from a dataframe in R that contain a specific string (I can do this using grepl), but also the row directly below each pattern match. Removing the rows with the matching pattern seems simple enough using grepl:
df[!grepl("my_string",df$V1),]
The part I am stuck with is how to also remove the row below the row that containings the pattern that matches "my_string" in the above example.
Thank you for any suggestions anyone has!
Using grep you can get the row number where you find a pattern. Increment the row number by 1 and remove both the rows.
inds <- grep("my_string",df$V1)
result <- df[-unique(c(inds, inds + 1)), ]
Using tidyverse -
library(dplyr)
library(stringr)
result <- df %>%
filter({
inds <- str_detect("my_string", V1)
!(inds | lag(inds, default = FALSE))
})
I have a dataframe. I divide this dataframe in subframes of 6 rows each in a list.
I want if inside in those subframes exist the word "#ERROR" to be deleted all the dataframe( that contain even in one row the specific word) and to receive the list with smaller number of dataframes. Then I am going to convert the list in dataframe again. My problem is that I try different codes and I cannot figure out how to eliminate subdataframe with specific word from the list.
I try the follow
a<-dataset
View(a)
my.list<-split(a, rep(1:119, each = 6))
z=lapply(1:length(my.list), function(i) my.list[[i]] != "#ERROR")
but what I get they are 119 elements TRUE FALSE. But I want to eliminate those false... anyone please help....
Try using sapply as it is will return a vector instead of list like lapply.
new.list <- my.list[sapply(1:length(my.list), function(i)
all(my.list[[i]] != "#ERROR"))]
Or a bit simplified with Filter :
new.list <- Filter(function(x) all(x != "#ERROR"), my.list)
I have a column that has numeric and strings. I'd like to find only those rows that has a particular string and not the others. In this case, I only need rows that has SE and not the others.
df :
names
SE123, FE43, SA67
SE167, SE24, SE56, SE34
SE23
FE36, KE90, LS87
DG20, SE34, LP47
SE57, SE39
Result df
names
SE167, SE24, SE56, SE34
SE23
SE57, SE39
My code
df[grep("^SE", as.character(df$names)),]
But this selects every row that has SE. Would somebody please help in achieving the result df? Thanks.
Looking at your expected output it looks like you want to select those rows where every element starts with "SE" where each element is a word between two commas.
Using base R, one method would be to split the strings on "," and select rows where every element startsWith "SE"
df[sapply(strsplit(df$names, ","), function(x)
all(startsWith(trimws(x), "SE"))), , drop = FALSE]
# names
#2 SE167, SE24, SE56, SE34
#3 SE23
#6 SE57, SE39
If you want to find presence of "SE" irrespective of position maybe grepl is a better choice.
df[sapply(strsplit(df$names, ","), function(x)
all(grepl("SE", trimws(x)))), , drop = FALSE]
Make sure you have names as character column before doing strsplit or run
df$names <- as.character(df$names)
names[!grepl("[A-Z]",gsub("SE","",names))]
[1] "SE167, SE24, SE56, SE34" "SE23" "SE57, SE39"
You can remove the SE from all strings and then look for any character. Strings having only SE will not contain any other character and are thus kept by the filter.
(This also works if you have 25SE)
I have a dataframe with a column of codes separated by commas. I am currently filtering this dataframe by looking through the code column and if a code appears from the list, I keep that row. My issue is that this dataframe is expanding, as is the list of acceptable codes, so I'd like to speed this process up if possible. Ideally there would be a way to mark a row as already checked and if a good code was in the row, then to not have to check it again for all the other acceptable codes.
Current dataframe looks something like this:
Code_column
,12ab,
,12ab,123b,
,456t,345u,
,12ab,789p,
list of good codes:
good_codes <- c(',123b,', ',456t,', ',345u,')
My filtering process currently:
df %>%
filter(sapply(`Code_column`,
function(x) any(sapply(good_codes, str_detect, string = x))) == TRUE)
Final column
Code_column
,12ab,123b,
,456t,345u,
I think we do not need sapply
df[str_detect(df$Code_column,paste(good_codes, collapse = '|')),]
[1] ",12ab,123b," ",456t,345u,"
You can pass | & to str_detect
paste(good_codes, collapse = '|')
[1] ",123b,|,456t,|,345u,"
In genomics research, you often have many strings with duplicate gene names. I would like to find an efficient way to only keep the unique gene names in a string. This is an example that works. But, isn't it possible to do this in one step, i.e., without having to split the entire string and then having to past the unique elements back together?
genes <- c("GSTP1;GSTP1;APC")
a <- unlist(strsplit(genes, ";"))
paste(unique(a), collapse=";")
[1] "GSTP1;APC"
An alternative is doing
unique(unlist(strsplit(genes, ";")))
#[1] "GSTP1" "APC"
Then this should give you the answer
paste(unique(unlist(strsplit(genes, ";"))), collapse = ";")
#[1] "GSTP1;APC"
Based on the example showed, perhaps
gsub("(\\w+);\\1", "\\1", genes)
#[1] "GSTP1;APC"