R Match data tables using string matching

R Match data tables using string matching - r

I have two data tables:
dt1 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111" ), stringsAsFactors = F)
dt2 <- data.table(Code=c("AAA111", "AAA222", "AAA333", "AAA444", "AAA555", "AAA666", "BBB111", "BBB222", "BBB333", "BBB444", "BBB555", "BBB666", "CCC111", "CCC222", "CCC333", "CCC444", "CCC555", "CCC666", "DDD111", "DDD222", "DDD333", "DDD444", "DDD555", "DDD666", "EEE111", "EEE222", "EEE333", "EEE444", "EEE555", "EEE666"), stringsAsFactors = F)
dt2$Ref <- 1:nrow(dt2)
Each row in dt1 contains an unformatted string that includes a 'Code'. dt2 contains a list of codes that can be matched. What I am after is a way for the 'Code' part of the string in each row of dt1 to be identified and then matched with the corresponding code in dt2. If there is no matching code in dt2 then NA is returned.
Here is the type of output I am after:
dt3 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111"), Code=c("AAA111", "BBB222", "CCC333", "DDD444", "CCC333", "AAA111", "EEE222", "NA"), Ref=c("1", "8", "15", "22", "15", "1", "26", "NA"), stringsAsFactors = F)
I have tried using regex, grep etc. to find a solution but have not got anywhere.

You can use regex_left_join from my fuzzyjoin package:
library(fuzzyjoin)
regex_left_join(dt1, dt2, by = c(V1 = "Code"))
#> V1 Code Ref
#> 1: Apple Pear Orange, AAA111 AAA111 1
#> 2: Grapes Banana Pear .BBB222 BBB222 8
#> 3: Orange Kiwi Melon ,CCC333. CCC333 15
#> 4: Apple DDD444, Pear Orange DDD444 22
#> 5: Kiwi Melon Orange, CCC333 CCC333 15
#> 6: Apple Pear Orange, AAA111 AAA111 1
#> 7: Tomato Cucumber-EEE222 EEE222 26
#> 8: Seagull Pigeon ZZZ111 NA NA

Related

R remove strings from a column matched in a list

I'm trying to remove specific strings from a data.frame column, that are matched with entries from a list of strings.
names_to_remove <- c("Peter", "Thomas Loco", "Sarah Miller", "Diana", "Burak El", "Stacy")
data$text
| text |
|Sarah Miller apple tree |
|Peter peach cake |
|Thomas Loco banana bread |
|Diana apple cookies |
|Burak El melon juice |
|Stacy maple tree |
The actual data.frame has ~50k rows, and the list has ~15k entries.
Yet I tried to replace the strings with data$text <- str_replace(data$text, regex(str_c("\\b",names_to_remove, "\\b", collapse = '|')), "name") but this leaves me with an empty column of NA values. Do you have an idea how to solve this?

If df is your dataframe:
df <- structure(list(text = c("Sarah Miller apple tree", "Peter peach cake", "Thomas Loco banana bread", "Diana apple cookies", "Burak El melon juice ", "Stacy maple tree ")), class = "data.frame", row.names = c(NA, -6L))
text
1 Sarah Miller apple tree
2 Peter peach cake
3 Thomas Loco banana bread
4 Diana apple cookies
5 Burak El melon juice
6 Stacy maple tree
We could do:
library(dplyr)
library(stringr)
pattern <- paste(names_to_remove, collapse = "|")
df %>%
mutate(text = str_remove(text, pattern))
text
1 apple tree
2 peach cake
3 banana bread
4 apple cookies
5 melon juice
6 maple tree

R_exclude rows with a column containing a value if multiple rows exist

I have a dataframe "test" as below. I want to exclude all the rows of that person, if this person has "apple" in the "fruit" column, using R language.
I wrote:
filter(test, name != test$name[test$fruit=="apple"])
original "test" data frame
actual result
expected result
Any help is appreciated! Thanks!

> test
name fruit
1 kevin apple
2 kevin pear
3 kevin peach
4 jack apple
5 jack pear
6 jack peach
7 jack kiwi
8 caleb grapefruit
9 caleb kiwi
10 caleb pear
11 justin pineapple
12 justin grape
13 justin watermelon
14 justin kiwi
First, we find the all the 'name' which have 'apple' as a fruit.
df=unique(test$name[test$fruit=="apple"])
> df
[1] kevin jack
Levels: caleb jack justin kevin
Now we need to remove rows from rows from test where name is same as those in df, i.e 'kevin' or 'jack'.
test1= test[ (!(test$name %in% df)),]
> test1
name fruit
8 caleb grapefruit
9 caleb kiwi
10 caleb pear
11 justin pineapple
12 justin grape
13 justin watermelon
14 justin kiwi
Ofcourse we can write this in a single line :
test2=test[(!(test$name %in% (unique(test$name[test$fruit=="apple"])))),]
> test2
name fruit
8 caleb grapefruit
9 caleb kiwi
10 caleb pear
11 justin pineapple
12 justin grape
13 justin watermelon
14 justin kiwi

You can do this in multiple ways.
In base R :
subset(test, !ave(fruit == 'apple', name, FUN = any))
# name fruit
#4 Justin pineapple
#5 Justin grape
Using dplyr
test %>% group_by(name) %>% filter(!any(fruit == 'apple'))
Or data.table
setDT(test)[, .SD[!any(fruit == 'apple')], name]
Another option in base R without grouping could be
subset(test, !name %in% unique(name[fruit == "apple"]))
data
test <- data.frame(name = c('Jack', 'Jack', 'Jack', 'Justin', 'Justin'),
fruit =c('pineapple', 'apple', 'grape', 'pineapple', 'grape'))

Filtering for multiple strings within the same column in r

My large data set (Groceries) has a column in it containing character data (Fruits) all of which is lower case and all of which contains no punctuation.
It looks a bit like this:
# Groceries Data Frame
Id Groceries$Fruits
1 apple orange banana lemon grapefruit
2 grapes tomato passion fruit
3 strawberry orange kiwi
4 lemon orange passion fruit grapefruit lime
5 lemon orange passion fruit grapefruit lime peach
...
I'm trying to select all the rows (of which there are 3,320) from the Fruits column that contain 5 specific fruits (orange, lime, lemon, grapefruit & passion fruit). Initially I'm only interested in the rows that contain all 5 of these fruits and no additional Fruits. Thus, the only row out of these 5 that should be filtered/subsetted would be row 4. The fruits do not have to be in any particular order.
The data is actually answers to a test, so eventually I'm interested in determining who got 0/5 fruits, who got 1/5, 2/5 and so on...
I've tried 2 methods so far, both to no avail.
Firstly I tried using grep(), but no rows were stored in the resulting data frame.
# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit,
grapefruit", Groceries$Fruits), ]
And then I tried using filter(), but the selected rows don't contain just the 5 Fruits I'm seeking out, it selects all rows that contain any of the 5 fruits.
# 2nd attempt with filter
library(dplyr)
library(stringr)
CorrectFruits <- c("lemon", "orange", "passion fruit", "grapefruit",
"lime")
filter <- Groceries %>%
select(Id, Fruits) %>%
filter(str_detect(tolower(Fruits), pattern = CorrectFruits))
The result I'm after initially is a new DF containing all the columns in the Groceries table, but only the rows of those people who got all 5 of the chosen fruits correct.
Next, it would be cool to select the opposite; everyone who didn't get all 5 correct.
Finally, I'd love to be able to subset those who got a specific proportion correct. I.e. row 1 got 3 correct, row 2 only got 1 correct and row 3 only got 1 correct.
Any help would be greatly appreciated!
Here's an example of what some of the columns look like:
# Groceries
Id Age Nationality Colour question Fruits question
1 26-35 Canadian Red apple orange banana lemon grapefruit
2 26-35 US Blue grapes tomato passion fruit
3 46-55 Canadian Red strawberry orange kiwi
4 55+ US Red lemon orange passion fruit grapefruit lime
5 36-45 British Green lemon orange passion fruit grapefruit lime peach

Might need more clarification on what you intend on doing with answers that have all 5 fruits with some extra, but this should help you out. I substituted all instances of "passion fruit" with "passionfruit" to make it easier:
df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit",
"lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
which gives
ID Fruits Count
1 apple orange banana lemon grapefruit 3
2 grapes tomato passionfruit 1
3 strawberry orange kiwi 1
4 lemon orange passionfruit grapefruit lime 5
5 lemon orange passionfruit grapefruit lime peach 0
First line does the passionfruit substitution, and then str_count counts all occurrences of correct fruits in df$Fruit. Finally, if all 5 fruits are correct but there are extras, Count resets to 0.

Here is my answer after seeing others' genius solutions.
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple",
"apple",
"apple orange kiwi fifth",
"orange apple pineapple kiwi fifth",
"pineapple orange apple fifth kiwi"
)
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
heds1's reponse looks great. However, you want to be careful using string exacts such as grepl because it could return compound words. For example, consider the word pineapple; it contains pine and apple. Notice here that searching for apple returns pineapples.
filter(df, grepl("apple", Fruits))
ID Age Nationality Color Fruits
1 1 26-35 Canadian Correct pineapple
2 2 26-35 US Incorrect apple
3 3 46-55 Canadian Incorrect apple orange kiwi fifth
4 4 55+ US Correct orange apple pineapple kiwi fifth
5 5 56-45 British Correect pineapple orange apple fifth kiwi
The answer provided by sumshyftw is awesome. And I love that I am learning something from sumshyftw. But to demonstrate my point that unrestrained string search could mess your count:
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 2
5 5 56-45 British Correect pineapple orange apple fifth kiwi 2
Notice that it counted the pineapple as a correct answer despite that the only correct fruit is an apple. To overcome this, you want to wrap your words with \\b.
CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 0
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 1
5 5 56-45 British Correect pineapple orange apple fifth kiwi 1
R no longer counts pineapple as an apple.
But for the record, sumshyftw deserves the credit for working out the hard part in my example:
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 4
4 4 55+ US Correct orange apple pineapple kiwi fifth 5
5 5 56-45 British Correect pineapple orange apple fifth kiwi 5
To show only those with all five fruits:
df2 <- filter(df, df$Count == 5)
df2
ID Age Nationality Color Fruits Count
1 4 55+ US Correct orange apple pineapple kiwi fifth 5
2 5 56-45 British Correect pineapple orange apple fifth kiwi 5

Here's one way using grepl with a target list of keywords.
df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2",
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L,
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear",
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]
You can then change the criteria in the match_rows vector by changing the 5 to, for example 4 for four fruit matches, etc.

iterating each character in every row of one column

The example of the column is test <- c('apple #1930', 'apple #84555', 'apple A #33859', 'apple good', 'peach brand A - level 1 #8839', 'peach brand A - middle or not', 'peach brand A #2283')
I want my result table to be something as:
Name Description Number
apple NA #1930
apple NA #84555
apple A #33859
apple good NA
peach brand A level 1 #8839
peach brand A middle or not NA
peach brand A NA #2283
I've tried `
findiffs <- rle(test)
newdf <- data.frame(
firststring = test[cumsum(findiffs$length)],
secondstring = test[cumsum(findiffs$length)+1]
)
newdf <- newdf[-dim(newdf)[1],]
but it doesn't give me the output I desire.
Any help would be appreciated!

I am guessing that each column has its own delimiting character. So you might want to try something like this:
test <- data.frame(orig = c('apple #1930', 'apple #84555', 'apple A #33859', 'apple good', 'peach brand A - level 1 #8839', 'peach brand A - middle or not', 'peach brand A #2283'))
test %>% separate(orig, into= c("a", "b"), sep = "[#]") %>% separate(a, into=c("aa", "bb"), sep="[-]")
aa bb b
1 apple <NA> 1930
2 apple <NA> 84555
3 apple A <NA> 33859
4 apple good <NA> <NA>
5 peach brand A level 1 8839
6 peach brand A middle or not <NA>
7 peach brand A <NA> 2283

variable value occuring on 2 dates R

I want to find who had an apple or an orange on at least 2 different (unique) dates. I would like to create a new column with a binary indicator for whether an individual had an orange or an apple on at least two dates (1=yes, 0=no).
The nearest I've come is this plyr code.
df1<- ddply(df, .(names, fruit), mutate, acne = ifelse(fruit=="apple" | fruit=="orange" & length(unique(dates))>=2,1,0))
This is not the solution however. anne gets apples twice but on the same date, so she should not get a 1 here. Similarly ted gets a 1, even though he only got an apple once.
This is closer, but still not correct. It gives a 1 to any fruit that has occurred twice. Need the fruit to occur twice per person on two individual dates per person
df2<- ddply(df, .(fruit), mutate, acne = ifelse(length(unique(dates))>=2, 1, 0
##this one gives a 1 to any fruit that has occurred twice. Need the fruit to occur twice per person on two individual dates per person.
If anyone could point me in the right direction here I would be very grateful.
Thank you in advance
SAMPLE DF
names<-as.character(c("john", "john", "philip", "ted", "john", "john", "anne", "john", "mary","anne", "mary","mary","philip","mary", "su","mary", "jim", "sylvia", "mary", "ted","ted","mary", "sylvia", "jim", "ted", "john", "ted"))
dates<-as.Date(c("2010-07-01", "2010-07-13", "2010-05-12","2010-02-14","2010-06-30","2010-08-15", "2010-03-21","2010-04-04","2010-09-01", "2010-03-21", "2010-12-01", "2011-01-01", "2010-08-12", "2010-11-11", "2010-05-12", "2010-12-03", "2010-07-12", "2010-12-21", "2010-02-18", "2010-10-29", "2010-08-13", "2010-11-11", "2010-05-12", "2010-04-01", "2010-05-06", "2010-09-28", "2010-11-28" ))
fruit<-as.character(c("kiwi","apple","mango", "banana","strawberry","orange","apple","raspberry", "orange","apple","orange", "apple", "strawberry", "apple", "pineapple", "peach", "orange", "nectarine", "grape","banana", "melon", "apricot", "plum", "lychee", "mango", "watermelon", "apple" ))
df<-data.frame(names,dates,fruit)
df
Desired ouput
names dates fruit v1
7 anne 2010-03-21 apple 0
10 anne 2010-03-21 apple 0
17 jim 2010-07-12 orange 0
24 jim 2010-04-01 lychee 0
1 john 2010-07-01 kiwi 1
2 john 2010-07-13 apple 1
5 john 2010-06-30 strawberry 1
6 john 2010-08-15 orange 1
8 john 2010-04-04 raspberry 1
26 john 2010-09-28 watermelon 1
9 mary 2010-09-01 orange 1
11 mary 2010-12-01 orange 1
12 mary 2011-01-01 apple 1
14 mary 2010-11-11 apple 1
16 mary 2010-12-03 peach 1
19 mary 2010-02-18 grape 1
22 mary 2010-11-11 apricot 1
3 philip 2010-05-12 mango 0
13 philip 2010-08-12 strawberry 0
15 su 2010-05-12 pineapple 0
18 sylvia 2010-12-21 nectarine 0
23 sylvia 2010-05-12 plum 0
4 ted 2010-02-14 banana 0
20 ted 2010-10-29 banana 0
21 ted 2010-08-13 melon 0
25 ted 2010-05-06 mango 0
27 ted 2010-11-28 apple 0

this should probably do the trick:
v1 = ave(1:nrow(df),df$names,FUN=function(x) length(unique(df$dates[x[df$fruit[x]
%in% c("orange","apple")]]))>1)
df$v1 = v1
df = df[order(df$names),]

If I understood correctly, for the purpose of your problem, apples == oranges. So the plan is
to (1) create a small data.frame where fruits are oranges or apples only, as you don't care about other fruits, (b) select only unique date/name rows, (c) aggregate by name and (d) merge back to your original data.frame to get your result:
ndf <- subset(df, fruit %in% c("apple", "orange"))
ndf <- ndf[!duplicated(ndf[, c("names", "dates")]), ]
Here you can use table, but I prefer aggregate
v <- aggregate(rep(1, nrow(ndf)), by = ndf[, "names", drop = FALSE], sum)
v$x <- ifelse(v$x > 1, 1, 0)
rv <- merge(df, v)
It is a bit longer, codewise, than other answers but clear and most certainly does the job.
You could just use aggregate without the first two parts, but if you have huge data.frame, with lots of names aggregating for every name can prove very costly.

I did something similar to #amit's solution using by. Rownames got mangled during do.call, but you can fix that.
result <- by(df, INDICES = df$names, FUN = function(x) {
if (length(unique(x$dates)) == 1) {
x$index <- 0
return(x)
}
ao.sum <- sum(x$fruit %in% c("apple", "orange"))
if (ao.sum < 2) x$index <- 0 else x$index <- 1
x
})
do.call("rbind", result)
names dates fruit index
anne.7 anne 2010-03-21 apple 0
anne.10 anne 2010-03-21 apple 0
jim.17 jim 2010-07-12 orange 0
jim.24 jim 2010-04-01 lychee 0
john.1 john 2010-07-01 kiwi 1
john.2 john 2010-07-13 apple 1
john.5 john 2010-06-30 strawberry 1
john.6 john 2010-08-15 orange 1
john.8 john 2010-04-04 raspberry 1
john.26 john 2010-09-28 watermelon 1
mary.9 mary 2010-09-01 orange 1
mary.11 mary 2010-12-01 orange 1
mary.12 mary 2011-01-01 apple 1
mary.14 mary 2010-11-11 apple 1
mary.16 mary 2010-12-03 peach 1
mary.19 mary 2010-02-18 grape 1
mary.22 mary 2010-11-11 apricot 1
philip.3 philip 2010-05-12 mango 0
philip.13 philip 2010-08-12 strawberry 0
su su 2010-05-12 pineapple 0
sylvia.18 sylvia 2010-12-21 nectarine 0
sylvia.23 sylvia 2010-05-12 plum 0
ted.4 ted 2010-02-14 banana 0
ted.20 ted 2010-10-29 banana 0
ted.21 ted 2010-08-13 melon 0
ted.25 ted 2010-05-06 mango 0
ted.27 ted 2010-11-28 apple 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Match data tables using string matching - r

Related

R remove strings from a column matched in a list

R_exclude rows with a column containing a value if multiple rows exist

Filtering for multiple strings within the same column in r

iterating each character in every row of one column

variable value occuring on 2 dates R

Categories

Resources