select text from multiple combinations of text within a dataframe R - r

I want to subset data based on a text code that is used in numerous combinations throughout one column of a df. I checked first all the variations by creating a table.
list <- as.data.frame(table(EQP$col1))
I want to search within the dataframe for the text "EFC" (even when combined with other letters) and subset these rows so that I have a resultant dataframe that looks like this.
I have looked through this question here, but this does not answer the question. I have reviewed the tidytext package, but this does not seem to be the solution either.
How to Extract keywords from a Data Frame in R

You can simply use grepl.
Considering your data.frame is called df and the column to subset on is col1
df <- data.frame(
col1 = c("eraEFC", "dfs", "asdj, aslkj", "dlja,EFC,:LJ)"),
stringsAsFactors = F
)
df[grepl("EFC", df$col1), , drop = F]

Another option besides the mentioned solution by Gallarus would be:
library(stringr)
library(dplyr)
df %>% filter(str_detect(Var1, "EFC"))
As described by Sam Firke in this post:
Selecting rows where a column has a string like 'hsa..' (partial string match)

Related

How do I hash in these 2 dataframes in R?

so I have these 2 columns (genome$V9 and Impact4$INFO) from 2 different dataframes as shown below.
Basically there is a value inside each Impact4$INFO row (structure would be like OE6AXXXXXXX where X is an integer) that I want to filter in each row inside genome$V9. I understand it is complicated since there are a lot of values inside both columns...
Thank you
Column1
Column2
You can extract numbers from strings quite easily, when the structure is consistent. Given your structure is consistent you can try:
library(stringr)
test <- c("ID=OE6A002689", "ID=OE6A044524", "ID=OE6A057168TI")
str_extract(test, "[0-9]{6}")
Output is:
[1] "002689" "044524" "057168"
Given you want to filter your genome data based on this, you can try:
library(dplyr)
library(stringr)
ids <- str_extract(Impact4$INFO, "[0-9]{6}")
genome %>%
mutate(ind = str_extract(V9, "[0-9]{6}")) %>%
filter(ind %in% ids)
Hope that helps? Otherwise you have to provide a reproducible example (post exapmle data here).

Filter row based on a string condition, dplyr filter, contains [duplicate]

This question already has answers here:
Selecting data frame rows based on partial string match in a column
(4 answers)
Closed 1 year ago.
I want to filter a dataframe using dplyr contains() and filter. Must be simple, right? The examples I've seen use base R grepl which sort of defeats the object. Here's a simple dataframe:
site_type <- c('Urban','Rural','Rural Background','Urban Background','Roadside','Kerbside')
df <- data.frame(row_id, site_type)
df <- as.tibble(df)
df
Now I want to filter the dataframe by all rows where site.type contains the string background.
I can find the string directly if I know the unique values of site_type:
filtered_df <- filter(df, site_type == 'Urban Background')
But I want to do something like:
filtered_df <- filter(df, site_type(contains('background', match_case = False)))
Any ideas how to do that? Can dplyr helper contains only be used with columns and not rows?
The contains function in dplyr is a select helper. It's purpose is to help when using the select function, and the select function is focused on selecting columns not rows. See documentation here.
filter is the intended mechanism for selecting rows. The function you are probably looking for is grepl which does pattern matching for text.
So the solution you are looking for is probably:
filtered_df <- filter(df, grepl("background", site_type, ignore.case = TRUE))
I suspect that contains is mostly a wrapper applying grepl to the column names. So the logic is very similar.
References:
grep R documentation
high rated question applying exactly this technique

R: Subsetting one column from a dataframe, KEEPING the column name

I'm learning R, and I'm modifying a small piece of code. How do I make a subset of a dataframe, which is a single column, that includes the column name?
This does not work, as it doesn't retain the column name.
Data1Subset <- Data1$Level
The code sample I'm modifying follows this up with
colnames(Data1)
Also, is.data.frame(Data1) is TRUE
I finally found this with Google
Data1Subset <- subset(Data1, select = "Level")
Try the code below
Data1Subset <- Data1["Level"]

Loop Through Column Names with Similar Structure [duplicate]

This question already has answers here:
How to extract columns with same name but different identifiers in R
(3 answers)
Closed 3 years ago.
I have a very large dataset. Of those, a small subset have the same column name with an indexing value that is numeric (unlike the post "How to extract columns with same name but different identifiers in R" where the indexing value is a string). For example
Q_1_1, Q_1_2, Q_1_3, ...
I am looking for a way to either loop through just those columns using the indices or to subset them all at once.
I have tried to use paste() to write their column names but have had no luck. See sample code below
Define Dataframe
df = data.frame("Q_1_1" = rep(1,5),"Q_1_2" = rep(2,5),"Q_1_3" = rep(3,5))
Define the Column Name Using Paste
cn <- as.symbol(paste("Q_1_",1, sep=""))
cn
df$cn
df$Q_1_1
I want df$cn to return the same thing as df$Q_1_1, but df$cn returns NULL.
If you are just trying to subset your data frame by column name, you could use dplyr for subseting all your indexed columns at once and a regex to match all column names with a certain pattern:
library(dplyr)
df = data.frame("Q_1_1" = rep(1,5),"Q_1_2" = rep(2,5),"Q_1_3" = rep(3,5), "A_1" = rep(4,5))
newdf <- df %>%
dplyr::select(matches("Q_[0-9]_[0-9]"))
the [0-9] in the regex matches any digit between the _. Depending on what variable you're trying to match you might have to change the regular expression.
The problem with your solution was that you only saved the name of your columns but did not actually assign it back to the data frame / to a column.
I hope this helps!

R update table column based on search string from another table

I am trying to update Cell B in a table based on the value of cell A in the same table. To filter the rows I want to update I am using grepl to compare cell A to a list of character strings from a list/table/vector or some other external source. For all rows where cell A matches the search criteria, I want to update cell B to say "xxxx". I need to do this for all rows in my table.
So far I have something like this where cat1 is a list of some sort that has strings to search for.
for (x in 1:length(cat1)){
data %<>% mutate(Cat = ifelse(grepl(cat1[i],ItemName),"xxx",Cat))
}
I am open to any better way of accomplishing this. I've tried for loops with dataframes and I'm open to a data.table solution.
Thank you.
To avoid the loop you can collapse the character vector with | and then use it as a single pattern in grepl, for example you can try:
cat1_collapsed <- paste(cat1, collapse = "|")
data %>% mutate(Cat = ifelse(grepl(cat1_collapsed, ItemName),"xxx", Cat))
Or the equivalent using data.table (or base R of course).
use the following code assuming that you have a data frame called "data" with column "A" and "B" and that "cat1" is a vector of the desired strings, as described
library(data.table)
setDT(data)
data[A %in% cat1,B:="XXXX"]

Resources