fuzzy join with partial string match - r

I have a dataframe with two columns which can contain literally any character of various formats and i would like to match them.
library(stringr)
library(fuzzyjoin)
x <- data.frame(idX=1:3, string=c("silver", "30BEDJE202AA", "30BEDJE2027"))
y <- data.frame(idY=letters[1:3], seed=c("sliver", "30BEDJE202ABC", "30BEDJE2027BL"))
x$string = as.character(x$string)
y$seed = as.character(y$seed)
x %>% fuzzy_left_join(y, by = c(string = "seed"), match_fun = str_detect)
Here is the result i get when running the above code:
idX string idY seed
1 1 silver <NA> <NA>
2 2 30BEDJE202AA <NA> <NA>
3 3 30BEDJE2027 <NA> <NA>
And this is what i would like to have:
idX string idY seed
1 1 silver a sliver
2 2 30BEDJE202AA b 30BEDJE202ABC
3 3 30BEDJE2027 c 30BEDJE2027BL
Is there a way to get there?

Related

R - Parse text string into multiple columns & extract data values

I have a large dataset in the form shown below:
ID
Scores
1
English 3, French 7, Geography 8
2
Spanish 7, Classics 4
3
Physics 5, English 5, PE 7, Art 4
I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:
ID
English
French
Geography
Spanish
Classics
Physics
PE
Art
1
3
7
8
-
-
-
-
-
2
-
-
-
7
4
-
-
-
3
5
-
-
-
-
5
7
4
I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:
df$scores2 <- str_to_lower(df$Scores)
split <- separate(
df,
scores2,
into = paste0("Subject", 1:8),
sep = "\\,",
remove = FALSE,
convert = FALSE,
extra = "warn",
fill = "warn",
)
I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?
You can first strsplit the Scores column to split on subject-score pairs (which would be in a list), then unnest the list-column into rows. Then separate the subject-score pairs into Subject and Score columns. Finally transform the data from a "long" format to a "wide" format.
Thanks #G. Grothendieck for improving my code:)
library(tidyverse)
df %>%
separate_rows(Scores, sep = ", ") %>%
separate(Scores, sep = " ", into = c("Subject", "Score")) %>%
pivot_wider(names_from = "Subject", values_from = "Score")
# A tibble: 3 × 9
ID English French Geography Spanish Classics Physics PE Art
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 3 7 8 NA NA NA NA NA
2 2 NA NA NA 7 4 NA NA NA
3 3 5 NA NA NA NA 5 7 4
Using data.table
library(data.table)
setDT(dt)
dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")
Results
# ID Art Classics English French Geography PE Physics Spanish
# 1: 1 <NA> <NA> 3 7 8 <NA> <NA> <NA>
# 2: 2 <NA> 4 <NA> <NA> <NA> <NA> <NA> 7
# 3: 3 4 <NA> 5 <NA> <NA> 7 5 <NA>
Data
dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8",
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA,
-3L), class = c("data.frame"))

fuzzy_left_join with match_fun %in%

Some data
example_df <- data.frame(
url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
numbs = 1:5
)
lookup_df <- data.frame(
string = c('blog', 'subscription', 'UK'),
group = c('blog', 'subs', 'UK')
)
library(fuzzyjoin)
data_combined <- example_df %>%
fuzzy_left_join(lookup_df, by = c("url" = "string"),
match_fun = `%in%`)
data_combined
url numbs string group
1 blog/blah 1 <NA> <NA>
2 blog/?utm_medium=foo 2 <NA> <NA>
3 blah 3 <NA> <NA>
4 subscription/apples 4 <NA> <NA>
5 UK/something 5 <NA> <NA>
I expected data_combined to have values for string and group where there's a match based on match_fun. Instead all NA.
Example, the first value of string in lookup_df is 'blog'. Since this is %in% the first value of example_df string, expected a match with value 'blog' and 'blog' in string and group fields.
If we want to do a partial match with the word before the / in the 'url' with the 'string' column in 'lookup_df', we could extract that substring as a new column and then do a regex_left_join
library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
mutate(string = str_remove(url, "\\/.*")) %>%
regex_left_join(lookup_df, by = 'string') %>%
select(url, numbs, group)
-output
# url numbs group
#1 blog/blah 1 blog
#2 blog/?utm_medium=foo 2 blog
#3 blah 3 <NA>
#4 subscription/apples 4 subs
#5 UK/something 5 UK

How to count multiple variables in dataset a pipe by incorporating the grep function?

I need to count multiple variables in a dataset in one go using a pipe.
I have used the following code:
#R
NonComp_Strat <- Minor_Behaviours %>%
filter(Categories == "Non compliant with routine") %>%
group_by(Strategies) %>%
summarise(frequency= n())
However, in my data frame some cells contain multiple entries separated by a comma.
For example
It treats the following behaviour entries differently, "Disruptive" , and "Disruptive, Off Task".
Both behaviour entries in the data frame have the variable i am looking for but i don't know how to wrap the grep or grepl function into the pipe to count all the individual variables. There are over 20 of them and doing over 20 individual grep functions sounds terrible. Any help is greatly appreciated.
thanks,
Dan
You will first have to split the comma separated values and make new rows out of them. Then you can group_by as you were doing:
library(splitstackshape)
df <- data.frame(id = c(1:4), Strategies = c("Disruptive", "Disruptive, Off Task", "Off Task", "Off Task, Interview"))
df
id Strategies
1 1 Disruptive
2 2 Disruptive, Off Task
3 3 Off Task
4 4 Off Task, Interview
df <- cSplit(df, "Strategies", ",", "long")
df
id Strategies
1: 1 Disruptive
2: 2 Disruptive
3: 2 Off Task
4: 3 Off Task
5: 4 Off Task
6: 4 Interview
In one dplyr and tidyr workflow:
df %>%
separate(Strategies, paste("Strategies", 1:5, sep = "_"), extra = "drop", sep = ",") %>%
gather(Stacked, Strategies, Strategies_1:Strategies_5) %>%
select(-Stacked) %>%
na.omit() %>%
mutate(Strategies = as.factor(trimws(Strategies))) %>%
group_by(Strategies) %>%
summarise(count = n())
Strategies count
<fct> <int>
1 Brief Time Out 1
2 Detention 2
3 Disruptive 2
4 Interview 1
5 Off Task 1
More general we could design a split function that produces reshapeable data.
spltCol <- function(x) {
l <- strsplit(as.character(x), ", ?")
l <- lapply(l, function(y) c(y, rep(NA, max(lengths(l)) - length(y))))
return(as.data.frame(do.call(rbind, l)))
}
Example
df1
# id x z
# 1 1 alpha, beta, gamma 0.7281856
# 2 2 alpha, beta -0.3149730
# 3 3 alpha -2.6412875
# 4 4 <NA> 0.6412990
df12 <- data.frame(append(df1[-2], spltCol(df1$x)))
# id z V1 V2 V3
# 1 1 0.7281856 alpha beta gamma
# 2 2 -0.3149730 alpha beta <NA>
# 3 3 -2.6412875 alpha <NA> <NA>
# 4 4 0.6412990 <NA> <NA> <NA>
reshape(df12, direction="long", varying=cbind("V1", "V2", "V3"), v.names=names(df1)[2])
# id z time x
# 1.1 1 0.7281856 1 alpha
# 2.1 2 -0.3149730 1 alpha
# 3.1 3 -2.6412875 1 alpha
# 4.1 4 0.6412990 1 <NA>
# 1.2 1 0.7281856 2 beta
# 2.2 2 -0.3149730 2 beta
# 3.2 3 -2.6412875 2 <NA>
# 4.2 4 0.6412990 2 <NA>
# 1.3 1 0.7281856 3 gamma
# 2.3 2 -0.3149730 3 <NA>
# 3.3 3 -2.6412875 3 <NA>
# 4.3 4 0.6412990 3 <NA>
Data
df1 <- structure(list(id = 1:4, x = structure(c(3L, 2L, 1L, NA), .Label = c("alpha",
"alpha, beta", "alpha, beta, gamma"), class = "factor"), z = c(0.72818559355044,
-0.314973049072542, -2.64128753187138, 0.641298995312115)), class = "data.frame", row.names = c(NA,
-4L))

Erasing value in one variable if value in another variable do not have a match in a list in dplyr

I have a table with two fields:
dd <- data.frame(measure = c("a", "a", "b", "b", "c", "c"), class = c(1,11,2,22,3,33), stringsAsFactors = F)
dd
measure class
1 a 1
2 a 11
3 b 2
4 b 22
5 c 3
6 c 33
For each measure, a class is associated. However, not all class can be associated to each measure value. Actually, the only values allowed per measure are available in a list:
ls <- list(a=c(1,10), b=c(2,20,200), c=c(3,30,90))
ls
$`a`
[1] 1 10
$b
[1] 2 20 200
$c
[1] 3 30 90
I need to erase (replace by NA), the measure where the class as no match in the list. I succeeded in base R:
good_match <- mapply(function(xx, yy) any(xx %in% yy), ls[dd$measure], dd$class)
dd$measure[!good_match] <- NA
dd
measure class
1 a 1
2 <NA> 11
3 b 2
4 <NA> 22
5 c 3
6 <NA> 33
However, I would like to do it in dplyr, probably with mutate, so I can pipe
it and make it fit better in my script. I've tried:
library(dplyr)
dd %>% mutate(measure = ifelse(any(class %in% ls[[measure]]), measure, NA))
Error in ls[[measure]] : recursive indexing failed at level 2
I have a feeling it fails because of a problem of vectorization of some sort but I'm stuck. Do you know of a another, more elegant way, of achieving my goal?
We can use a join after converting the named list to a tibble/data.frame
library(tidyverse)
enframe(ls, value = 'class') %>%
unnest %>%
right_join(dd, by = 'class') %>%
transmute(measure = name, class)
# A tibble: 6 x 2
# measure class
# <chr> <dbl>
#1 a 1
#2 <NA> 11
#3 b 2
#4 <NA> 22
#5 c 3
#6 <NA> 33
A base R option would be using stack (instead of enframe) and merge.
NOTE: ls is name of a function. It is better not to name object identifiers with function names

Combine values in two columns together based specific conditions in R

I have data that looks like the following:
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors=FALSE)
print(moo)
Farm Barn_Yard
A A
B A
<NA> <NA>
<NA> A
A <NA>
B B
I am attempting to combine the columns into one variable where if they are the same the results yields what is found in both columns, if both have data the result is what is in the Farm column, if both are <NA> the result is <NA>, and if one has a value and the other doesn't the result is the value present in the column that has the value. Thus, in this instance the result would be:
oink <- data.frame(Animal_House = c("A","B",NA,"A","A","B"),
stringsAsFactors = FALSE)
print(oink)
Animal_House
A
B
<NA>
A
A
B
I have tried the unite function from tidyr but it doesn't give me exactly what I want. Any thoughts? Thanks!
dplyr::coalesce does exactly that, substituting any NA values in the first vector with the value from the second:
library(dplyr)
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors = FALSE)
oink <- moo %>% mutate(Animal_House = coalesce(Farm, Barn_Yard))
oink
#> Farm Barn_Yard Animal_House
#> 1 A A A
#> 2 B A B
#> 3 <NA> <NA> <NA>
#> 4 <NA> A A
#> 5 A <NA> A
#> 6 B B B
If you want to discard the original columns, use transmute instead of mutate.
A less succinct option is to use a couple ifelse() statements, but this could be useful if you wish to introduce another condition or column into the mix.
moo <- data.frame(Farm = c("A","B",NA,NA,"A","B"),
Barn_Yard = c("A","A",NA,"A",NA,"B"),
stringsAsFactors = FALSE)
moo$Animal_House = with(moo,ifelse(is.na(Farm) & is.na(Barn_Yard),NA,
ifelse(!is.na(Barn_Yard) & is.na(Farm),Barn_Yard,
Farm)))

Resources