NSRegularExpression to match all string - nsregularexpression

I have more string like:
1 bottle of juice
1 bottle of pomato
1 package of pasta
1 butter
I want search a total combination of string
Like: juice, pomato, pasta
The combination is with the AND operator not with the OR operator like:
juice AND pomato AND pasta
Thanks

Related

Replacing values in one table from a corresponding key in another table by specific column

I am processing a large dataset from a questionnaire that contains coded responses in some but not all columns. I would like to replace the coded responses with actual values. The key/dictionary is stored in another database. The complicating factor is that different questions (stored as columns in original dataset) used the same code (typically numeric), but the code has different meanings depending on the column (question).
How can I replace the coded values in the original dataset with different valuse from a corresponding key stored in the dictionary table, but do it by specific column name (also stored in the dictionary table)?
Below is an example of the original dataset and the dictionary table, as well as desired result.
original <- data.frame(
name = c('Jane','Mary','John', 'Billy'),
home = c(1,3,4,2),
car = c('b','b','a','b'),
shirt = c(3,2,1,1),
shoes = c('Black','Black','Black','Brown')
)
keymap <- data.frame(
column_name=c('home','home','home','home','car','car','shirt','shirt','shirt'),
value_old=c('1','2','3','4','a','b','1','2','3'),
value_new=c('Single family','Duplex','Condo','Apartment','Sedan','SUV','White','Red','Blue')
)
result <- data.frame(
name = c('Jane','Mary','John', 'Billy'),
home = c('Single family','Condo','Apartment','Duplex'),
car = c('SUV','SUV','Sedan','SUV'),
shirt = c('Blue','Red','White','White'),
shoes = c('Black','Black','Black','Brown')
)
> original
name home car shirt shoes
1 Jane 1 b 3 Black
2 Mary 3 b 2 Black
3 John 4 a 1 Black
4 Billy 2 b 1 Brown
> keymap
column_name value_old value_new
1 home 1 Single family
2 home 2 Duplex
3 home 3 Condo
4 home 4 Apartment
5 car a Sedan
6 car b SUV
7 shirt 1 White
8 shirt 2 Red
9 shirt 3 Blue
> result
name home car shirt shoes
1 Jane Single family SUV Blue Black
2 Mary Condo SUV Red Black
3 John Apartment Sedan White Black
4 Billy Duplex SUV White Brown
I have tried different approaches using dplyr but have not gotten far as I do not have a robust understanding of the mutate/join syntax.
We may loop across the unique values from the 'column_name' column of 'keymap' in the original, subset the keymap that matches the column name (cur_column()), select the columns 2 and 3, deframe to a named vector and match with the values of the column for replacement
library(dplyr)
library(tibble)
original %>%
mutate(across(all_of(unique(keymap$column_name)), ~
(keymap %>%
filter(column_name == cur_column()) %>%
select(-column_name) %>%
deframe)[as.character(.x)]))
-output
name home car shirt shoes
1 Jane Single family SUV Blue Black
2 Mary Condo SUV Red Black
3 John Apartment Sedan White Black
4 Billy Duplex SUV White Brown
Or an approach in base R
lst1 <- split(with(keymap, setNames(value_new, value_old)), keymap$column_name)
original[names(lst1)] <- Map(\(x, y) y[as.character(x)],
original[names(lst1)], lst1)
Please check below code where we can use the factor to replace the values in one column with data from another dataframe here in this case with keymap
library(tidyverse)
original %>% mutate(home=factor(home, keymap$value_old, keymap$value_new),
car=factor(car, keymap$value_old, keymap$value_new),
shirt=factor(shirt, keymap$value_old, keymap$value_new)
)
Created on 2023-02-04 with reprex v2.0.2
name home car shirt shoes
1 Jane Single family SUV Condo Black
2 Mary Condo SUV Duplex Black
3 John Apartment Sedan Single family Black
4 Billy Duplex SUV Single family Brown

R remove strings from a column matched in a list

I'm trying to remove specific strings from a data.frame column, that are matched with entries from a list of strings.
names_to_remove <- c("Peter", "Thomas Loco", "Sarah Miller", "Diana", "Burak El", "Stacy")
data$text
| text |
|Sarah Miller apple tree |
|Peter peach cake |
|Thomas Loco banana bread |
|Diana apple cookies |
|Burak El melon juice |
|Stacy maple tree |
The actual data.frame has ~50k rows, and the list has ~15k entries.
Yet I tried to replace the strings with data$text <- str_replace(data$text, regex(str_c("\\b",names_to_remove, "\\b", collapse = '|')), "name") but this leaves me with an empty column of NA values. Do you have an idea how to solve this?
If df is your dataframe:
df <- structure(list(text = c("Sarah Miller apple tree", "Peter peach cake", "Thomas Loco banana bread", "Diana apple cookies", "Burak El melon juice ", "Stacy maple tree ")), class = "data.frame", row.names = c(NA, -6L))
text
1 Sarah Miller apple tree
2 Peter peach cake
3 Thomas Loco banana bread
4 Diana apple cookies
5 Burak El melon juice
6 Stacy maple tree
We could do:
library(dplyr)
library(stringr)
pattern <- paste(names_to_remove, collapse = "|")
df %>%
mutate(text = str_remove(text, pattern))
text
1 apple tree
2 peach cake
3 banana bread
4 apple cookies
5 melon juice
6 maple tree

imputing missing values in R dataframe

I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.
Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles
I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))

Remove commas within numbers only

I have a list of product sales (and their cost) which have frustratingly been concatenated into a single string, separated by commas. I ultimately need to separate out each product into unique rows which is easy enough with stringr::str_split.
However, the cost associated with each product has comma to show thousands e.g. 1,000.00 or 38,647.89. Therefore str_split is splitting products incorrectly as it hits commas within a product's cost.
I was wondering what the best tidyverse solution would be to remove all commas which are surrounded by numbers so that 1,000.00 becomes 1000.00 and 38,647.89 becomes 38647.89. Once these commas are removed I can str_split on the commas which delimit the products and thus split each unique product into its own row.
Here is a dummy dataset:
df<-data.frame(id = c(1, 2), product = c("1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78", "1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78"))
df
id product
1 1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78
2 1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78
Expected outcome:
id product
1 1 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
2 2 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
df %>%
mutate(product = product %>% str_replace_all("([0-9]),([0-9])", "\\1\\2"))
Result
id product
1 1 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
2 2 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
> apply(df,1,function(x){gsub(",([0-9])","\\1",x[2])})
[1] "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78"
[2] "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78"
A way via base R can be,
sapply(strsplit(as.character(df$product), ' '), function(i)paste(sub(',', '', i), collapse = ' '))
#[1] "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78" "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78"
library(tidyverse)
df$product <- str_replace_all(df$product, "(?<=\\d),(?=\\d)", "")
df
id product
1 1 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
2 2 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78

Filtering for multiple strings within the same column in r

My large data set (Groceries) has a column in it containing character data (Fruits) all of which is lower case and all of which contains no punctuation.
It looks a bit like this:
# Groceries Data Frame
Id Groceries$Fruits
1 apple orange banana lemon grapefruit
2 grapes tomato passion fruit
3 strawberry orange kiwi
4 lemon orange passion fruit grapefruit lime
5 lemon orange passion fruit grapefruit lime peach
...
I'm trying to select all the rows (of which there are 3,320) from the Fruits column that contain 5 specific fruits (orange, lime, lemon, grapefruit & passion fruit). Initially I'm only interested in the rows that contain all 5 of these fruits and no additional Fruits. Thus, the only row out of these 5 that should be filtered/subsetted would be row 4. The fruits do not have to be in any particular order.
The data is actually answers to a test, so eventually I'm interested in determining who got 0/5 fruits, who got 1/5, 2/5 and so on...
I've tried 2 methods so far, both to no avail.
Firstly I tried using grep(), but no rows were stored in the resulting data frame.
# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit,
grapefruit", Groceries$Fruits), ]
And then I tried using filter(), but the selected rows don't contain just the 5 Fruits I'm seeking out, it selects all rows that contain any of the 5 fruits.
# 2nd attempt with filter
library(dplyr)
library(stringr)
CorrectFruits <- c("lemon", "orange", "passion fruit", "grapefruit",
"lime")
filter <- Groceries %>%
select(Id, Fruits) %>%
filter(str_detect(tolower(Fruits), pattern = CorrectFruits))
The result I'm after initially is a new DF containing all the columns in the Groceries table, but only the rows of those people who got all 5 of the chosen fruits correct.
Next, it would be cool to select the opposite; everyone who didn't get all 5 correct.
Finally, I'd love to be able to subset those who got a specific proportion correct. I.e. row 1 got 3 correct, row 2 only got 1 correct and row 3 only got 1 correct.
Any help would be greatly appreciated!
Here's an example of what some of the columns look like:
# Groceries
Id Age Nationality Colour question Fruits question
1 26-35 Canadian Red apple orange banana lemon grapefruit
2 26-35 US Blue grapes tomato passion fruit
3 46-55 Canadian Red strawberry orange kiwi
4 55+ US Red lemon orange passion fruit grapefruit lime
5 36-45 British Green lemon orange passion fruit grapefruit lime peach
Might need more clarification on what you intend on doing with answers that have all 5 fruits with some extra, but this should help you out. I substituted all instances of "passion fruit" with "passionfruit" to make it easier:
df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit",
"lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
which gives
ID Fruits Count
1 apple orange banana lemon grapefruit 3
2 grapes tomato passionfruit 1
3 strawberry orange kiwi 1
4 lemon orange passionfruit grapefruit lime 5
5 lemon orange passionfruit grapefruit lime peach 0
First line does the passionfruit substitution, and then str_count counts all occurrences of correct fruits in df$Fruit. Finally, if all 5 fruits are correct but there are extras, Count resets to 0.
Here is my answer after seeing others' genius solutions.
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple",
"apple",
"apple orange kiwi fifth",
"orange apple pineapple kiwi fifth",
"pineapple orange apple fifth kiwi"
)
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
heds1's reponse looks great. However, you want to be careful using string exacts such as grepl because it could return compound words. For example, consider the word pineapple; it contains pine and apple. Notice here that searching for apple returns pineapples.
filter(df, grepl("apple", Fruits))
ID Age Nationality Color Fruits
1 1 26-35 Canadian Correct pineapple
2 2 26-35 US Incorrect apple
3 3 46-55 Canadian Incorrect apple orange kiwi fifth
4 4 55+ US Correct orange apple pineapple kiwi fifth
5 5 56-45 British Correect pineapple orange apple fifth kiwi
The answer provided by sumshyftw is awesome. And I love that I am learning something from sumshyftw. But to demonstrate my point that unrestrained string search could mess your count:
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 2
5 5 56-45 British Correect pineapple orange apple fifth kiwi 2
Notice that it counted the pineapple as a correct answer despite that the only correct fruit is an apple. To overcome this, you want to wrap your words with \\b.
CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 0
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 1
5 5 56-45 British Correect pineapple orange apple fifth kiwi 1
R no longer counts pineapple as an apple.
But for the record, sumshyftw deserves the credit for working out the hard part in my example:
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 4
4 4 55+ US Correct orange apple pineapple kiwi fifth 5
5 5 56-45 British Correect pineapple orange apple fifth kiwi 5
To show only those with all five fruits:
df2 <- filter(df, df$Count == 5)
df2
ID Age Nationality Color Fruits Count
1 4 55+ US Correct orange apple pineapple kiwi fifth 5
2 5 56-45 British Correect pineapple orange apple fifth kiwi 5
Here's one way using grepl with a target list of keywords.
df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2",
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L,
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear",
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]
You can then change the criteria in the match_rows vector by changing the 5 to, for example 4 for four fruit matches, etc.

Resources