R update table column based on search string from another table - r

I am trying to update Cell B in a table based on the value of cell A in the same table. To filter the rows I want to update I am using grepl to compare cell A to a list of character strings from a list/table/vector or some other external source. For all rows where cell A matches the search criteria, I want to update cell B to say "xxxx". I need to do this for all rows in my table.
So far I have something like this where cat1 is a list of some sort that has strings to search for.
for (x in 1:length(cat1)){
data %<>% mutate(Cat = ifelse(grepl(cat1[i],ItemName),"xxx",Cat))
}
I am open to any better way of accomplishing this. I've tried for loops with dataframes and I'm open to a data.table solution.
Thank you.

To avoid the loop you can collapse the character vector with | and then use it as a single pattern in grepl, for example you can try:
cat1_collapsed <- paste(cat1, collapse = "|")
data %>% mutate(Cat = ifelse(grepl(cat1_collapsed, ItemName),"xxx", Cat))
Or the equivalent using data.table (or base R of course).

use the following code assuming that you have a data frame called "data" with column "A" and "B" and that "cat1" is a vector of the desired strings, as described
library(data.table)
setDT(data)
data[A %in% cat1,B:="XXXX"]

Related

How make R ignore undefined columns selected error message in cbind()?

I have a dataframe A, and want to merge a column that exists in A with ones that do not exist in A. I want to make cbind ignore those columns that does not exist and cbind() only existing ones. Something similar to cbind(A$Key.Name,A$Dummy1,A$Dummy2), however preserving dataframe class of the data with the column names.
A<-fromJSON('[{"Key":{"Name":"Victor","ID":61426},"Type":"Unknown","Domain":"Cooking" }]',
flatten = T)
names(A)
cbind(A["Key.Name"],A["Dummy1"],A["Dummy2"])
Use intersect to select only those columns that are present in the data.
cols_to_select <- c('Key.Name', 'Dummy1', 'Dummy2')
result <- A[intersect(names(A), cols_to_select)]
In dplyr you can use any_of :
library(dplyr)
A %>% select(any_of(cols_to_select))

Filter row based on a string condition, dplyr filter, contains [duplicate]

This question already has answers here:
Selecting data frame rows based on partial string match in a column
(4 answers)
Closed 1 year ago.
I want to filter a dataframe using dplyr contains() and filter. Must be simple, right? The examples I've seen use base R grepl which sort of defeats the object. Here's a simple dataframe:
site_type <- c('Urban','Rural','Rural Background','Urban Background','Roadside','Kerbside')
df <- data.frame(row_id, site_type)
df <- as.tibble(df)
df
Now I want to filter the dataframe by all rows where site.type contains the string background.
I can find the string directly if I know the unique values of site_type:
filtered_df <- filter(df, site_type == 'Urban Background')
But I want to do something like:
filtered_df <- filter(df, site_type(contains('background', match_case = False)))
Any ideas how to do that? Can dplyr helper contains only be used with columns and not rows?
The contains function in dplyr is a select helper. It's purpose is to help when using the select function, and the select function is focused on selecting columns not rows. See documentation here.
filter is the intended mechanism for selecting rows. The function you are probably looking for is grepl which does pattern matching for text.
So the solution you are looking for is probably:
filtered_df <- filter(df, grepl("background", site_type, ignore.case = TRUE))
I suspect that contains is mostly a wrapper applying grepl to the column names. So the logic is very similar.
References:
grep R documentation
high rated question applying exactly this technique

Find whether a raw in data table contains at least one word from the list

I am quite new to R and data tables, so probably my question will sound obvious, but I searched through questions here for similar issues and couldn't find a solution anyway.
So, initially, I have a data table and one of the rows contains fields that have many values(in fact these values are all separate words) of the data joined together by &&&&. I also have a list of words (list). This list is big and has 38 000 different words. But for the purpose of example let's sat that it is small.
list <- c('word1', 'word2, 'word3')
What I need is to filter the data table so that I only have rows that contain at least one word from the list of words.
I unjoined the data by &&&&& and created a list
fields_with_words <-strsplit(data_final$fields_with_words,"&&&&")
But I don't know which function should I use to check whether the row from my data table has at least one word from the list. Can you give me some clues?
Try :
data_final[sapply(strsplit(data_final$fields_with_words,"&&&&"), function(x)
any(x %in% word_list)), ]
I have used word_list instead of list here since list is a built-in function in R.
Assuming you want to scan x variable in df with the list of words lw <- c("word1","word2","word3") (character vector of words), you can use
df[grepl(paste0("(",paste(lw, collapse = "|"), ")"), x)]
if you want regular expression. In particular you will have match also if your word is within a sentence. However, with 38k words, I don't know if this solution is scalable.
If your x column contains only words and you want exact matching, the problem is simpler. You can do:
df[any(x %chin% lw)]
%chin% is a data.table special %in% operator for character vectors (%in% can also be used but it will not be as performant). You can have better performance there if you use merge by transforming lw into a data.table:
merge(df, data.table(x = lw), by = "x")

Is there a function to subset data using a qualitative requirement in a column?

I am having trouble creating a subset for a large dataframe. I need to extract all rows that match one of two correct cities in one of the columns, however any subset that I create ends up empty. Given the main dataframe, I try:
New = data[data$Home.port %in% c("ARDGLASS","NEWLYN")]
However R returns "undefined columns selected"
A comma is missing:
New = data[data$Home.port %in% c("ARDGLASS","NEWLYN"), ]
That is because you are selecting rows, not columns; if you leave out the comma, R tries to subset columns instead of rows.
I recommend to use data.table so:
# install.packages(data.table)
library(data.table)
data <- as.data.table(data)
new_data <- data[Home.port %in% c("ARDGLASS","NEWLYN")]
You can check this web to learn data.table is very fast with big data bases
The subset function will also do this task
new <- subset(data, subset = Home.port %in% c("ARDGLASS","NEWLYN"))
The base approach is functionally the same, its just a matter of using a declarative function for the task or not.
When using subset() the first argument is the data frame you want to subset. When you want to check for several variables you do not need to put "data$" in front. This save time and makes it easier to read.
datasubset <- subset(data, Home.port %in% c("ARDGLASS","NEWLYN"))
You can also use multiple conditions to subset use "&" for AND condition or "|" for OR condition depending on what you plan to do.
datasubset <- subset(data, Home.port == "ARDGLASS" & Home.port == "NEWLYN"))

select text from multiple combinations of text within a dataframe R

I want to subset data based on a text code that is used in numerous combinations throughout one column of a df. I checked first all the variations by creating a table.
list <- as.data.frame(table(EQP$col1))
I want to search within the dataframe for the text "EFC" (even when combined with other letters) and subset these rows so that I have a resultant dataframe that looks like this.
I have looked through this question here, but this does not answer the question. I have reviewed the tidytext package, but this does not seem to be the solution either.
How to Extract keywords from a Data Frame in R
You can simply use grepl.
Considering your data.frame is called df and the column to subset on is col1
df <- data.frame(
col1 = c("eraEFC", "dfs", "asdj, aslkj", "dlja,EFC,:LJ)"),
stringsAsFactors = F
)
df[grepl("EFC", df$col1), , drop = F]
Another option besides the mentioned solution by Gallarus would be:
library(stringr)
library(dplyr)
df %>% filter(str_detect(Var1, "EFC"))
As described by Sam Firke in this post:
Selecting rows where a column has a string like 'hsa..' (partial string match)

Resources