Match on substring and other variables - r

I am trying to merge two dataset on key values and string patterns.
Basically, I would like a function to count the number of sub-string matching occurrences, conditional on other key variables matching across two db.
Across two datasets, base and merge, when the variables k1 and k2 match and merge$str2 is included in base$str I would like a function to count one match. Specifically, I would like whole words match whole words (e.g. "BB" is not a match for "BBB", as if the first was written ad "\bBB\b"). I have multi-million dataset to merge so efficiency is paramount.
I have a possible solution, but I am not sure is the best across very large datasets.
base <- data.frame(k1=rep(1,4),
k2=c(rep(1,3),2),
str=c("AA BBB","BB CCC","CCC","CCC"))
merge <- data.frame(k1=rep(1,2),
k2=rep(1,2),
str2=c("BB","CCC"))
library(stringr)
library(dplyr)
library(tidyr)
base %>%
left_join(merge) %>%
mutate(match=stringi::stri_detect_regex(str, paste0("\\b",str2,"\\b")),
match=replace_na(ifelse(match==T,1,0),0)) %>%
group_by(k1,k2,str) %>%
summarise(matches=sum(match))
desired_result <- data.frame(k1=rep(1,4),
k2=c(rep(1,3),2),
str=c("AA BBB","BB CCC","CCC","CCC"),
match=c(0,2,1,0))

Each time efficiency is an issue, data.table is worth to consider ..
May be this could help you ..
library(data.table)
m <- merge(as.data.table(base),
as.data.table(merge),
by = c("k1", "k2"),
all = TRUE,
allow.cartesian=TRUE)
m[, i := grepl(paste0("\\b", str2,"\\b", sep = ""), str), by = 1:nrow(m)]
m[, .(match = sum(i, na.rm = TRUE)), by = c("k1","k2", "str")]

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

fast replacement of data.table values by labels stored in another data.table

It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

R: Partial (or full) match in a data frame or list of lists

I have a data frame with roughly 20 000 rows and 215 columns and need to search, in which columns certain keywords occur (if they exist).
There are lots of suggestions for partial matches in a specified column, for example
Selecting data frame rows based on partial string match in a column
Alas, none of these functions seem to allow to search ALL columns. One option is of course to write several nested loops.
However, I wonder whether there is a much more efficient way == already existing function to search
a) all columns of a data frame (or: all lists within a list)?
b) and possibly not to search only for one phrase, but for a list of keywords?
For example
# some data
Species <- c("Acanthurus dussumieri", "Callionymus maculatus", "Eviota prasina", "Gymnogobius urotaenia", "Kyphosus bigibbus")
Column1 <- c(60.1, 106, 78.6, 21.5, 71)
ColumnEgg <- c(11.2, 14.5, 12, 8, NA)
Add_Info <- c("Spawns when water temperatures reach above 15°C.", NA, "females deposit eggs of 1.5 mm diameter on plants. Larvae hatch after 3-13 days.", NA, "55 cm TL newborn weighs 380 g")
df <- data.frame(Species, Column1, ColumnEgg, Add_Info)
df
Now it is easy to search, if one knows in which column to look for a pattern, e.g.
library(stringr)
library(dplyr)
df%>%
filter(str_detect(Species,"Aaptosyax"))
However: how to search all columns for a phrase or a list of keywords, like
df%>%
filter(str_detect(df[1:4],"Aaptosyax"))
or
keywords <- c("Aaptosyax", "egg")
df%>%
filter(str_detect(df[1:4],keywords))
Thanks a lot for any help!
A base R option using subset + grepl + rowSums
subset(
df,
rowSums(sapply(df, grepl, pattern = "Aaptosyax")) > 0
)
If you use :
grep(pattern = your_pattern, x = your_dataframe)
So if you search the word "planet" and your dataframe is named my_df then :
grep(pattern = "planet", x = my_df)
For a list of keywords you can separate them by a pipe '|' in the pattern :
grep(pattern = "planet|egg", x = my_df)
It will return the all the column where the pattern matched.
Use if_any -
library(dplyr)
library(stringr)
df %>% filter(if_any(1:4, str_detect, "Aaptosyax"))
For multiple keywords, collapse them into one string.
keywords <- c("Aaptosyax", "egg")
df %>% filter(if_any(1:4, str_detect, paste0(keywords, collapse = '|')))
Sticking with dplyr you can use across to filter through various columns.
df %>%
filter(if_any(everything(), ~str_detect(.,keywords))
The colwise vignette is a good guide https://dplyr.tidyverse.org/articles/colwise.html

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

ifelse/case_when with seemingly tricky strings in a character vector

I have a mortality dataframe with a character vector (rac) that contains varying strings per row. These strings flag contributing causes of death. Sometimes these strings have an extra whitespace between them (see id = 4, 5, 8). Some times they have exactly 3 characters and at other times they have 4 characters. What I am trying to do is sweep through by row and create a new column that flags whether a particular cause of death is seen in rac or not. Here are the data.
tdf <- structure(list(id = 1:10, rac = c("I250", "K922 R628",
"C259 T149 X599", "K729 C80 J80 N288", "X72 S019", "C189",
"C259 A419 K746 N390", "C349 C787 C793 C795 F179 I10 J449",
"C349 J449 R628", "F03 N189 R628")), row.names = c(NA, -10L),
class = "data.frame")
Take id = 8, where I can easily create a flag called cause_c that notes when C793 or C795 are seen with something like this snippet.
causex <- c("\\bC793|\\bC795")
tdf %>%
mutate(
cause_C = case_when(
str_detect(rac, causex) ~ 1,
TRUE ~ 0)
) -> tdf
It seems to work but I would like to be able to sweep in instances where the vector only shows 3 digits, say C79 and when this happens, cause_C should = 1. This is also a more efficient way to create the flags because then I don't have to spell out all possible versions of the code (C793, C794, C79, and so on), and because I have multiple causes to go through and flag some 16 likely causes of death. But if I try the following id = 8 will end up as all 0s.
tdf %>%
mutate(
cause_C = case_when(
str_sub(rac, 1, 3) == "C79" ~ 1,
TRUE ~ 0)
) -> tdf
There is something I am missing with the ifelse()\case_when() solution and if anyone spots my mistake and the fix, I would be very appreciative! And oh, base-R, data.table(), dplyr(), all solutions are welcome because I would be happy to see the speed comparisons too given the dataframe is chewing up more than 1.5 gigs.
Thank you!
Ani
If you want to use data.table, would you consider splitting up the rows by diagnostic code, then use grepl to match to your vector of desired diagnoses?
library(data.table)
causex <- c("C793", "C795")
search_causex <- paste(causex, collapse = "|")
setDT(tdf, key = "rac")
tdf[, list(rac = unlist(strsplit(rac, " "))), by = id][
, result := grepl(search_causex, rac)][
result == TRUE]
If you want to search by fewer characters you could use this for search pattern:
search_causex <- "C79(.+)"
A tidyverse similar approach could be:
library(tidyverse)
tdf %>%
separate_rows(rac, sep = " ") %>%
filter(grepl(search_causex, rac) == TRUE)

Resources