I have a dataframe rawdata with columns that contain ecological information. I am trying to eliminate all of the rows for which the column LatinName matches a vector of species for which I already have some data, and create a new dataframe with only the species that are missing data. So, what I'd like to do is something like:
matches <- c("Thunnus thynnus", "Balaenoptera musculus", "Homarus americanus")
# obviously these are a random subset; the real vector has ~16,000 values
rawdata_missing <- rawdata %>% filter(LatinName != "matches")
This doesn't work because the boolean operator can't be applied to a character string. Alternatively I could do something like this:
rawdata_missing <- filter(rawdata, !grepl(matches, LatinName)
This doesn't work either because !grepl also can't use the character string.
I know there are a lot of ways I could subset rawdata using the rows where LatinName IS in matches, but I can't figure out a neat way to subset rawdata such that LatinName is NOT in matches.
Thanks in advance for the help!
filteredData <- rawdata[!(rawdata$LatinName %in% Matches), ]
Another way by using subset, paste, mapply and grepl is...
fileteredData <- subset(rawdata,mapply(grepl,rawdata$LatinName,paste(Matches,collapse = "|")) == FALSE)
Related
How do I keep only rows that contain a certain string given a list of strings. What I'm trying to say is I don't want to use grepl() and hardcode the values I would like to exclude. Let's assume that I want to only keep records that contain abc or bbc or bcc or 20 more options in one of the columns, and I have x <- c("abc", "bbc", ....).
What can I do to only keep records containing values of x in the dataframe?
You can use %in%:
df_out <- df[df$v1 %in% x, ]
Or, you could form a regex alternation with the values in x and then use grepl:
regex <- paste0("^(?:", paste(x, collapse="|"), ")$")
df_out <- df[grepl(regex, df$v1), ]
The stringi package has good functions for extracting string pattern matches
newdat <- stringi::stri_extract_all(str, pattern)
https://rdrr.io/cran/stringi/man/stri_extract.html
You can even pass the function a list of strings as your pattern to match
I would like to remove rows from a dataframe in R that contain a specific string (I can do this using grepl), but also the row directly below each pattern match. Removing the rows with the matching pattern seems simple enough using grepl:
df[!grepl("my_string",df$V1),]
The part I am stuck with is how to also remove the row below the row that containings the pattern that matches "my_string" in the above example.
Thank you for any suggestions anyone has!
Using grep you can get the row number where you find a pattern. Increment the row number by 1 and remove both the rows.
inds <- grep("my_string",df$V1)
result <- df[-unique(c(inds, inds + 1)), ]
Using tidyverse -
library(dplyr)
library(stringr)
result <- df %>%
filter({
inds <- str_detect("my_string", V1)
!(inds | lag(inds, default = FALSE))
})
I would like to subset a data frame (Data) by column names. I have a character vector with column name IDs I want to exclude (IDnames).
What I do normally is something like this:
Data[ ,!colnames(Data) %in% IDnames]
However, I am facing the problem that there is a name "X-360" and another one "X-360.1" in the columns. I only want to exclude the "X-360" (which is also in the character vector), but not "X-360.1" (which is not in the character vector, but extracted anyway). - So I want only exact matches, and it seems like this does not work with %in%.
It seems such a simple problem but I just cannot find a solution...
Update:
Indeed, the problem was that I had duplicated names in my data.frame! It took me a while to figure this out, because when I looked at the subsetted columns with
Data[ ,colnames(Data) %in% IDnames]
it showed "X-360" and "X-360.1" among the names, as stated above.
But it seems this was just happening when subsetting the data, before there were just columns with the same name ("X-360") - and that happened because the data frame was set up from matrices with cbind.
Here is a demonstration of what happened:
D1 <-matrix(rnorm(36),nrow=6)
colnames(D1) <- c("X-360", "X-400", "X-401", "X-300", "X-302", "X-500")
D2 <-matrix(rnorm(36),nrow=6)
colnames(D2) <- c("X-360", "X-406", "X-403", "X-300", "X-305", "X-501")
D <- cbind(D1, D2)
Data <- as.data.frame(D)
IDnames <- c("X-360", "X-302", "X-501")
Data[ ,colnames(Data) %in% IDnames]
X-360 X-302 X-360.1 X-501
1 -0.3658194 -1.7046575 2.1009329 0.8167357
2 -2.1987411 -1.3783129 1.5473554 -1.7639961
3 0.5548391 0.4022660 -1.2204003 -1.9454138
4 0.4010191 -2.1751914 0.8479660 0.2800923
5 -0.2790987 0.1859162 0.8349893 0.5285602
6 0.3189967 1.5910424 0.8438429 0.1142751
Learned another thing to be careful about when working with such data in the future...
One regex based solution here would be to form an alternation of exact keyword matches:
regex <- paste0("^(?:", paste(IDnames, collapse="|"), ")$")
Data[ , !grepl(regex, colnames(Data))]
I have a dataframe with a column of codes separated by commas. I am currently filtering this dataframe by looking through the code column and if a code appears from the list, I keep that row. My issue is that this dataframe is expanding, as is the list of acceptable codes, so I'd like to speed this process up if possible. Ideally there would be a way to mark a row as already checked and if a good code was in the row, then to not have to check it again for all the other acceptable codes.
Current dataframe looks something like this:
Code_column
,12ab,
,12ab,123b,
,456t,345u,
,12ab,789p,
list of good codes:
good_codes <- c(',123b,', ',456t,', ',345u,')
My filtering process currently:
df %>%
filter(sapply(`Code_column`,
function(x) any(sapply(good_codes, str_detect, string = x))) == TRUE)
Final column
Code_column
,12ab,123b,
,456t,345u,
I think we do not need sapply
df[str_detect(df$Code_column,paste(good_codes, collapse = '|')),]
[1] ",12ab,123b," ",456t,345u,"
You can pass | & to str_detect
paste(good_codes, collapse = '|')
[1] ",123b,|,456t,|,345u,"
I would like to come up with an efficient way of finding a string in a data.frame including the values stored in row names.
Present approach
Primitively, I can achieve that running this code:
data(mtcars)
mtcars$rows <- row.names(mtcars)
sapply(mtcars, function(x) { grep("mazda",x, ignore.case = TRUE) })
I don't like it as it returns data for all columns:
> length(sapply(mtcars, function(x) { grep("mazda",x, ignore.case = TRUE) }))
[1] 12
I would like to prettify this code so it only returns:
column name for successful match
row name for successful match
in format column X row
Additional considerations
Following suggestions expressed in comments, it occurred to me that I would also like to search column names, if possible.