I would like to remove rows from a dataframe in R that contain a specific string (I can do this using grepl), but also the row directly below each pattern match. Removing the rows with the matching pattern seems simple enough using grepl:
df[!grepl("my_string",df$V1),]
The part I am stuck with is how to also remove the row below the row that containings the pattern that matches "my_string" in the above example.
Thank you for any suggestions anyone has!
Using grep you can get the row number where you find a pattern. Increment the row number by 1 and remove both the rows.
inds <- grep("my_string",df$V1)
result <- df[-unique(c(inds, inds + 1)), ]
Using tidyverse -
library(dplyr)
library(stringr)
result <- df %>%
filter({
inds <- str_detect("my_string", V1)
!(inds | lag(inds, default = FALSE))
})
Related
How do I keep only rows that contain a certain string given a list of strings. What I'm trying to say is I don't want to use grepl() and hardcode the values I would like to exclude. Let's assume that I want to only keep records that contain abc or bbc or bcc or 20 more options in one of the columns, and I have x <- c("abc", "bbc", ....).
What can I do to only keep records containing values of x in the dataframe?
You can use %in%:
df_out <- df[df$v1 %in% x, ]
Or, you could form a regex alternation with the values in x and then use grepl:
regex <- paste0("^(?:", paste(x, collapse="|"), ")$")
df_out <- df[grepl(regex, df$v1), ]
The stringi package has good functions for extracting string pattern matches
newdat <- stringi::stri_extract_all(str, pattern)
https://rdrr.io/cran/stringi/man/stri_extract.html
You can even pass the function a list of strings as your pattern to match
I am trying to get all sentences from a dataframe containing specific words into a new dataframe. I don't really know how to do this, but the first step I tried was to check if a word is in the column.
> "quality" %in% df$text[2]
[1] FALSE
> df$text[2]
[1] "Audio quality is definitely good"
Why is the output false?
Also, do you have any suggestion on how to create my new dataframe? I'd like to, as an example, have a dataframe with all words containing c("word1","word2").
Thank you very much in advance.
It is not a fixed match. If we need to partially match, use grepl
grepl("quality", df$text[2])
If we are doing this to check if there are any 'quality' in the column, wrap with any
any(grepl("quality", df$text))
For multiple elements, paste them together with collapse = "|"
v1 <- c("word1","word2")
any(grepl(paste(v1, collapse="|"), df$text))
According to ?%in%
%in% is currently defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
where match matches the string based on an exact match.
I am new to this community, currently working on a R project in which I need to find each of the element separated by comma in a dataframe, on any of the columns in another dataframe, here is an example below:
#DataFrame1
a=c("AA,BB","BB,CC,FF","CC,DD,GG,FF","GG","")
df1=as.data.frame(a)
#DataFrame2
x=c("AA","XX","BB","YY","ZZ","MM","YY","CC")
y=c("DD""VV","NN","XX","CC","AA","WW","FF")
z=c("CC","AA","YY","GG","HH","OO","PP","QQ")
df2=as.data.frame(x,y,z)
what I need to do is find, if any of the elements, lets take for example "AA,BB" (which is the first cell in column x of df1) "AA" is an element and "BB" is another element , is available on any of the columns (x,y,x) in df2, if a match is found I need to identify that row or rows, there is also a possibility of more then one match on df2 rows
. Hope I was able to explain this problem well, expert please help
Here it is a solution in 2 steps:
# load tidyverse
library(tidyverse)
Step 1: Split the elements separated by comma from df1 in a new data frame new_df
1a) To do this, we first identify the number of columns to be generated
(as the maximum number of elements separated by ,; that is: maximum number of , + 1)
number_new_columns <- max(sapply(df1$a, function(x) str_count(x, ","))) + 1
1b) Generate the new data frame new_df
new_df <- df1 %>%
separate(a, c(as.character(seq_len(number_new_columns)))) # missing pieces will be filled with NA
# Above, we used c(as.character(seq_len(number_new_columns))) to generate column names as numbers -- not very creative :)
Step 2: Identify the position of each unique element from new_df in df2
(hope I understood correctly this second part of the question)
2a) Get the unique elements (from new_df)
unique_elements <- unlist(new_df) %>%
unique()
2b) Get a list whose components contain the positions of each unique element within df2
output <- lapply(unique_elements, function(x) {
which(df2 == x, arr.ind=TRUE)
})
I'd like to insert an underscore after the first three characters of all variable names in a data frame. Any help would be much appreciated.
Current data frame:
df1 <- data.frame("genCrc_b1"=c(1,1,1),"genprd"=c(1,1,1) ,"genopr_b1_b2"=c(1,1,1))
Desired data frame:
df2 <- data.frame("gen_Crc_b1"=c(1,1,1),"gen_prd"=c(1,1,1) ,"gen_opr_b1_b2"=c(1,1,1))
My attempts:
gsub('^(.{3})(.*)$', "_", names(df1))
gsub('^(.{3})(.*)$', '\\_\\2', names(df1))
We can use sub to capture the first 3 characters as a group ((.{3})) and in the replacement specify the backreference of the group (\\1) followed by underscore
names(df1) <- sub("^(.{3})", "\\1_", names(df1))
names(df1)
#[1] "gen_Crc_b1" "gen_prd" "gen_opr_b1_b2"
In the OP's post, especially the last one, there were two capture groups, but only one was specified
gsub('^(.{3})(.*)$', '\\1_\\2', names(df1))
BTW, gsub is not needed as we are replacing only at a single instance instead of multiple times.
In the first case, none of backreference for the captured groups were used in the replacement
If your variable names all begin with gen, we can also do the following.
colnames(df1) <- gsub("gen", "gen_", colnames(df1), fixed = TRUE)
You can also use regmatches<- to replace the sub-expressions.
regmatches(names(df1), regexpr("gen", names(df1), fixed=TRUE)) <- "gen_"
Now, check that the values have been properly changed.
names(df1)
[1] "gen_Crc_b1" "gen_prd" "gen_opr_b1_b2"
Here, regexpr finds the first position in each element of the character vector that matches the subexpression, "gen". These positions are fed to regmatches and the substitution is performed.
I have a dataframe rawdata with columns that contain ecological information. I am trying to eliminate all of the rows for which the column LatinName matches a vector of species for which I already have some data, and create a new dataframe with only the species that are missing data. So, what I'd like to do is something like:
matches <- c("Thunnus thynnus", "Balaenoptera musculus", "Homarus americanus")
# obviously these are a random subset; the real vector has ~16,000 values
rawdata_missing <- rawdata %>% filter(LatinName != "matches")
This doesn't work because the boolean operator can't be applied to a character string. Alternatively I could do something like this:
rawdata_missing <- filter(rawdata, !grepl(matches, LatinName)
This doesn't work either because !grepl also can't use the character string.
I know there are a lot of ways I could subset rawdata using the rows where LatinName IS in matches, but I can't figure out a neat way to subset rawdata such that LatinName is NOT in matches.
Thanks in advance for the help!
filteredData <- rawdata[!(rawdata$LatinName %in% Matches), ]
Another way by using subset, paste, mapply and grepl is...
fileteredData <- subset(rawdata,mapply(grepl,rawdata$LatinName,paste(Matches,collapse = "|")) == FALSE)