Pattern Searching in R

Pattern Searching in R - r

I have two data frames as below. DF1 is slighly messy (as you can see below) has multiple values from DF2 combined into one column.
DF1
SRNo. Value
1 1ABCD2EFGH3IJKL
2 1ABCD2EFGH3IJKL/7MLPO0OKMN8MNBV
3 3ABCD4EFGH5IJKL
4 3ABCD4EFGH5IJKL/1ABCD2EFGH3IJKL
5 7MLPO0OKMN8MNBV/9IUYT7HGFD3LKJH
DF2
SRNo. Value
1 1ABCD2EFGH3IJKL
2 3ABCD4EFGH5IJKL
3 6PQRS7TUVW8XYZA
4 5FGHI9XUZX1RATP
5 9AGTY6UGFW0AAUU
6 6TEYD7RARA8MHAT
7 9IUYT7HGFD3LKJH
I want to do a look up using values column in both the data set. Here is what I am trying to accomplish.
i) For rows 1 & 3 in DF1 it is a simple look up in DF2. I expect the code to return those looked up values.
ii) For row #3 in DF1, only first part of the string matches with a value in DF2. I expect the code to return only the first part.
iii) For row#4 in DF1, both the parts in the string matches with values in DF2. In this case I want the first part of the string that is matching to be retained
iv) For Row #5, the second part in the string matches with the value in DF2. I would expect the code to return the 2nd part of the string.
I have around 47000 rows in first dataset and over 300,000 in second dataset and ofcourse there are other columns in both the datasets. I have tried this in multiple ways using str_split/str_match but could not accomplish what I want to. Every suggestion is appreciated. My rest of the coding is in R.
Thank You

First step is to tidyr::separate() your DF1 at "/". Then I used dplyr::case_when() to see if there was a match between the first of the listed items in DF2 with %in%; if there wasn't then check against the second. I used dplyr::mutate() to append the results to DF1 under dat.
library(dplyr)
library(tidyr)
DF1 <- data.frame("SRNo." = 1:5, Value = c("1ABCD2EFGH3IJKL","1ABCD2EFGH3IJKL/7MLPO0OKMN8MNBV","3ABCD4EFGH5IJKL","3ABCD4EFGH5IJKL/1ABCD2EFGH3IJKL","7MLPO0OKMN8MNBV/9IUYT7HGFD3LKJH"), stringsAsFactors = F) %>% tbl_df()
DF2 <- data.frame("SRNo." = 1:7, Value = c("1ABCD2EFGH3IJKL","3ABCD4EFGH5IJKL","6PQRS7TUVW8XYZA","5FGHI9XUZX1RATP","9AGTY6UGFW0AAUU","6TEYD7RARA8MHAT","9IUYT7HGFD3LKJH"), stringsAsFactors = F) %>%tbl_df()
DF1 %>%
separate(Value, c("Value1", "Value2"), sep = "/") %>%
mutate(dat = case_when(
Value1 %in% DF2$Value ~ Value1,
Value2 %in% DF2$Value ~ Value2,
TRUE ~ NA_character_
))
# # A tibble: 5 x 4
# SRNo. Value1 Value2 dat
# <int> <chr> <chr> <chr>
# 1 1 1ABCD2EFGH3IJKL NA 1ABCD2EFGH3IJKL
# 2 2 1ABCD2EFGH3IJKL 7MLPO0OKMN8MNBV 1ABCD2EFGH3IJKL
# 3 3 3ABCD4EFGH5IJKL NA 3ABCD4EFGH5IJKL
# 4 4 3ABCD4EFGH5IJKL 1ABCD2EFGH3IJKL 3ABCD4EFGH5IJKL
# 5 5 7MLPO0OKMN8MNBV 9IUYT7HGFD3LKJH 9IUYT7HGFD3LKJH

Data.table solution
df1 <- read.table(text="SRNo. Value
1 1ABCD2EFGH3IJKL
2 1ABCD2EFGH3IJKL/7MLPO0OKMN8MNBV
3 3ABCD4EFGH5IJKL
4 3ABCD4EFGH5IJKL/1ABCD2EFGH3IJKL
5 7MLPO0OKMN8MNBV/9IUYT7HGFD3LKJH", header = T, stringsAsFactors = F)
df2 <- read.table( text = "SRNo. Value
1 1ABCD2EFGH3IJKL
2 3ABCD4EFGH5IJKL
3 6PQRS7TUVW8XYZA
4 5FGHI9XUZX1RATP
5 9AGTY6UGFW0AAUU
6 6TEYD7RARA8MHAT
7 9IUYT7HGFD3LKJH", header = T, stringsAsFactors = F )
library( data.table )
setDT(df1)[, c( "Value1", "Value2" ) := tstrsplit( Value, "/", fixed = TRUE)]
setDT(df2)
resultv1 <- df2[ df1, on = c( Value = "Value1"), nomatch = 0L ]
resultv2 <- df2[ df1, on = c( Value = "Value2"), nomatch = 0L ]
result <- rbindlist( list( resultv1, resultv2 ) )[!duplicated( i.SRNo.)]
Benchmarking it against the solution from #Paul shows similar runtimes (~2.5 miliseconds).. But data.table sometimes surprises me on larger data-sets..
If memory becomes an issue, you can do it all in one go:
rbindlist( list( setDT(df2)[ setDT(df1)[, c( "Value1", "Value2" ) := tstrsplit( Value, "/", fixed = TRUE)],
on = c( Value = "Value1"), nomatch = 0L ],
setDT(df2)[ setDT(df1)[, c( "Value1", "Value2" ) := tstrsplit( Value, "/", fixed = TRUE)],
on = c( Value = "Value2"), nomatch = 0L ] ) )[!duplicated( i.SRNo.)]

Related

Splitting strings into components

For example, I have a data table with several columns:
column A column B
key_500:station and loc 2
spectra:key_600:type 9
alpha:key_100:number 12
I want to split the rows of column A into components and create new columns, guided by the following rules:
the value between "key_" and ":" will be var1,
the next value after ":" will be var2,
the original column A should retain the part of string that is prior to ":key_". If it is empty (as in the first line), then replace "" with an "effect" word.
My expected final data table should be like this one:
column A column B var1 var2
effect 2 500 station and loc
spectra 9 600 type
alpha 12 100 number

Using tidyr extract you can extract specific part of the string using regex.
tidyr::extract(df, columnA, into = c('var1', 'var2'), 'key_(\\d+):(.*)',
convert = TRUE, remove = FALSE) %>%
dplyr::mutate(columnA = sub(':?key_.*', '', columnA),
columnA = replace(columnA, columnA == '', 'effect'))
# columnA var1 var2 columnB
#1 effect 500 station and loc 2
#2 spectra 600 type 9
#3 alpha 100 number 12
If you want to use data.table you can break this down in steps :
library(data.table)
setDT(df)
df[, c('var1', 'var2') := .(sub('.*key_(\\d+).*', '\\1',columnA),
sub('.*key_\\d+:', '', columnA))]
df[, columnA := sub(':?key_.*', '', columnA)]
df[, columnA := replace(columnA, columnA == '', 'effect')]
data
df <- structure(list(columnA = c("key_500:station and loc",
"spectra:key_600:type", "alpha:key_100:number"),
columnB = c(2L, 9L, 12L)), class = "data.frame", row.names = c(NA, -3L))

You can use separate which uses non-letters and separates the string into columns defined in into
require(tidyr)
require(dplyr)
df=tribble(
~"column A",~"column B",
"key_500:station", 2,
"spectra:key_600:type", 9,
"alpha:key_100:number", 12)
df %>% separate("column A",into=c('column A','key','var1','var2'),fill='left') %>% select(-key) %>% select("column A","column B",var1,var2) %>%
mutate(`column A`=ifelse(is.na(`column A`),"effect",`column A`))
And this is a modified version to work with data.tables
require(tidyr)
require(data.table)
DT=data.table(
"column A"=
c("key_500:station and loc",
"spectra:key_600:type",
"alpha:key_100:number"),
"column B"=c(2,9,12))
DT=separate(sep = "[^[:alnum:] ]+",DT,"column A",into=c('column A','key','var1','var2'),fill='left')
DT$key=NULL
DT$`column A`=ifelse(is.na(DT$`column A`),"effect",DT$`column A`)
DT=DT[,c(1,4,2,3)]

Replace Dataframe Column Values with User Defined Function in R

I have a grouped set of values in a column I am trying to replace with a since value
col1
a
a;a;b;c
c;b;a
NA
b;b;b
I want to replace all values with either mixed or the single present value if for example a;a;a;a becomes a
Expected Output
col1
a
Mixed
Mixed
NA
b
Code
grouping = function(x){
y = as.list(strsplit(x, ";")[[1]])
#select first element, and test if each is the same element.
z = ""
for (i in 1:length(y)){
if (as.character(y[1]) != as.character(y[i])) {
z = 'mixed'
break
} else {
z = as.character(y[1])
}
}
return(z)
}
db %>%
select(col1) %>%
mutate(
test = grouping(col1)
)
I have tried it a few different ways and either end up with it not working at all or giving the value a for everything

A base R option via defining a user function f
f <- function(x) ifelse(length(u <- unique(unlist((strsplit(x, ";"))))) > 1, "Mixed", u)
such that
> transform(df, col1 = Vectorize(f)(col1))
col1
1 a
2 Mixed
3 Mixed
4 <NA>
5 b

You can also consider this for your function and use base R:
#Function
myfun <- function(x)
{
y <- unlist(strsplit(x, ";"))
if(length(unique(y))==1)
{
z <- unique(y)
} else
{
z <- 'Mixed'
}
}
#Apply
df$New <- apply(df,1,myfun)
Output:
df
col1 New
1 a a
2 a;a;b;c Mixed
3 c;b;a Mixed
4 <NA> <NA>
5 b;b;b b
Some data used:
#Data
df <- structure(list(col1 = c("a", "a;a;b;c", "c;b;a", NA, "b;b;b")), class = "data.frame", row.names = c(NA,
-5L))

We can extract the substring from the 'col1' which are letters, check the number of distinct elements with n_distinct, use case_when to change those which have more one unique elements to 'Mixed'
library(dplyr)
library(stringr)
library(purrr)
df1 %>%
mutate(col1 = case_when(map_dbl(str_extract_all(col1,
"[a-z]"), n_distinct) >1 ~ "Mixed",
is.na(col) ~ NA_character_,
TRUE ~ substr(col1, 1, 1)))
-output
# col1
#1 a
#2 Mixed
#3 Mixed
#4 <NA>
#5 b
Or another option is to split the column by the delimiter with separate_rows, and do a group by row_number to summarise elements having more than one row (after the distinct) to be 'Mixed'
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(col1) %>%
distinct() %>%
group_by(rn) %>%
summarise(col1 = case_when(n() > 1 ~ 'Mixed', TRUE ~ first(col1)),
.groups = 'drop') %>%
select(-rn)
-output
# A tibble: 5 x 1
# col1
# <chr>
#1 a
#2 Mixed
#3 Mixed
#4 <NA>
#5 b
Or using base R with a compact option
v1 <- gsub("([a-z])\\1+", "\\1", gsub(";", "", df1$col1))
replace(v1, nchar(v1) > 1, "Mixed")
#[1] "a" "Mixed" "Mixed" NA "b"
The issue in the OP's function is that it is extracting only the first [[1]] list element
as.list(strsplit(x, ";")[[1]])
as strsplit returns a list with length equal to the number of rows of the initial data. So, basically by selecting only the first, it is recycled
data
df1 <- structure(list(col1 = c("a", "a;a;b;c", "c;b;a", NA, "b;b;b")),
class = "data.frame", row.names = c(NA,
-5L))

You can write the grouping function as :
grouping <- function(x) {
sapply(strsplit(x, ';'), function(x)
if(length(unique(x)) == 1) unique(x) else 'Mixed')
}
db$test <- grouping(db$col1)
db
# col1 test
#1 a a
#2 a;a;b;c Mixed
#3 c;b;a Mixed
#4 <NA> <NA>
#5 b;b;b b

Replacing parts of value (starts_with) in dataframe with values from a different dataframe in R

I want to replace a part of the value in df1 with value from df2. If df1$col1 starts with the numbers in df2$col1, replace those four numbers (keep rest) with df2$col2. Same for df1$col2. Example: For 16122567 replace with 5059 resulting in 50592567. Have tried different kinds of starts_with, loops, for(i in ..), mutate etc.. Anyone? (I'm new to R).
df1 col1 col2
1 16122567 89992567
2 17236945 16126548
3 95781657 19995670
4 16126972 56972541
df2 col1 col2
1 1612 5059
2 1723 5044
3 8999 5094
4 1999 9053

Here is one way with dplyr. We can create a new column with first 4 characters of col1, left_join with df2, replace NA's with four characters of col2.x. Finally, we use substr to replace values at specific position.
library(dplyr)
df3 <- df1 %>%
mutate(col1 = substr(col1, 1, 4)) %>%
left_join(df2 %>% mutate(col1 = as.character(col1)), by = 'col1') %>%
mutate(col2.y = ifelse(is.na(col2.y), substr(col2.x, 1, 4), col2.y),
col2.x = as.character(col2.x))
substr(df3$col2.x, 1, 4) <- df3$col2.y
df3
# col1 col2.x col2.y
#1 1612 50592567 5059
#2 1723 50446548 5044
#3 9578 19995670 1999
#4 1612 50592541 5059

Here is another approach using base R. We can create a function to do the check and manipulate the text and then apply that function to any column we want to modify.
# the data
df1 <- data.frame(col1 = c(16122567, 17236945, 95781657, 16126972),
col2 = c(89992567, 16126548, 19995670, 56972541))
df2 <- data.frame(col1 = c(1612, 1723, 8999, 1999),
col2 = c(5059, 5044, 5094, 9053))
# a function to do the check and create the chimera strings
check_and_paste <- function(check1, check2, replacement) {
res <- c()
for (i in seq_along(check1)) {
four_digits <- substr(check1[i], 1, 4)
if (four_digits %in% check2) {
res[i] <- paste(replacement[which(four_digits == check2)],
substr(check1[i], 5, 8),
sep = "")
} else {
res[i] <- check1[i]
}
}
return(as.numeric(as.character(res))) # to return numbers
}
# apply to the first column
new_col1 <- check_and_paste(
check1 = df1$col1,
check2 = df2$col1,
replacement = df2$col2
)
# and the second
new_col2 <- check_and_paste(
check1 = df1$col2,
check2 = df2$col1,
replacement = df2$col2
)
# the new data frame
data.frame(new_col1, new_col2)
# new_col1 new_col2
#1 50592567 50942567
#2 50446945 50596548
#3 95781657 90535670
#4 50596972 56972541

Deduplicating a data frame when the order of values may differ in R

Let's say I have a data.frame that looks like this:
df = data.frame(from=c(1, 1, 2, 1),
to=c(2, 3, 1, 4),
title=c("A", "B", "A", "A"),
stringsAsFactors=F)
df is an object that holds all of the various connections for a network graph. I also have a second data.frame, which is the simplified graph data:
df2 = data.frame(from=c(1, 1, 3),
to=c(2, 4, 1),
stringsAsFactors=F)
What I need is to pull the title values from df into df2. I can't simply dedup df because a) from and to can be in different orders, and b) title is not unique between connections. The current condition I have is:
df2$title = df$title[df2$from == df$from & df2$to == df$to]
However, this results in too few rows due to the order of from and to being reversed in row 2 of df2. If I introduce an OR condtion, then I get too many results because the connection between 1 and 2 will be matched twice.
My question, then, is how do I effectively "dedup" the title variable to append it to df2?
The expected outcome is this:
from to title
1 1 2 A
2 1 4 A
3 3 1 B

library(dplyr);
merge(mutate(df2, from1 = pmin(from, to), to1 = pmax(from, to)),
mutate(df, from1 = pmin(from, to), to1 = pmax(from, to)),
by = c("from1", "to1"), all.x = T) %>%
select(from1, to1, title) %>% unique()
# from1 to1 title
#1 1 2 A
#3 1 3 B
#4 1 4 A
Another way we can try, where edgeSort function produce unique edges if the two vertices are the same and use match function to match all equal edges.
edgeSort <- function(df) apply(df, 1, function(row) paste0(sort(row[1:2]), collapse = ", "))
df2$title <- df$title[match(edgeSort(df2), edgeSort(df))]
df2
from to title
1 1 2 A
2 1 4 A
3 3 1 B

I guess you can do it in base R by 2 merge statements:
step1 <- merge(df2, df, all.x = TRUE)
step2 <- merge(df2[is.na(step1$title),], df, all.x = TRUE, by.x = c("to", "from"), by.y = c("from", "to"))
rbind(step1[!is.na(step1$title),], step2)
from to title
1 1 2 A
2 1 4 A
3 3 1 B

Aggregating data frame rows using an input vector

I have this toy data.frame:
df = data.frame(id = c("a","b","c","d"), value = c(2,3,6,5))
and I'd like to aggregate its rows according to this toy vector:
collapsed.ids = c("a,b","c","d")
where the aggregated data.frame should keep max(df$value) of its aggregated rows.
So for this toy example the output would be:
> aggregated.df
id value
1 a,b 3
2 c 6
3 d 5
I should note that my real data.frame is ~150,000 rows

I would use data.table for this.
Something like the following should work:
library(data.table)
DT <- data.table(df, key = "id") # Main data.table
Key <- data.table(ind = collapsed.ids) # your "Key" table
## We need your "Key" table in a long form
Key <- Key[, list(id = unlist(strsplit(ind, ",", fixed = TRUE))), by = ind]
setkey(Key, id) # Set the key to facilitate a merge
## Merge and aggregate in one step
DT[Key][, list(value = max(value)), by = ind]
# ind value
# 1: a,b 3
# 2: c 6
# 3: d 5

You don't need data.table, you can just use base R.
split.ids <- strsplit(collapsed.ids, ",")
split.df <- data.frame(id = tmp <- unlist(split.ids),
joinid = rep(collapsed.ids, sapply(split.ids, length)))
aggregated.df <- aggregate(value ~ id, data = merge(df, split.df), max)
Result:
# id value
# 1 a,b 3
# 2 c 6
# 3 d 5
Benchmark
df <- df[rep(1:4, 50000), ] # Make a big data.frame
system.time(...) # of the above code
# user system elapsed
# 1.700 0.154 1.947
EDIT: Apparently Ananda's code runs in 0.039, so I'm eating crow. But either are acceptable for this size.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pattern Searching in R - r

Related

Splitting strings into components

Replace Dataframe Column Values with User Defined Function in R

Replacing parts of value (starts_with) in dataframe with values from a different dataframe in R

Deduplicating a data frame when the order of values may differ in R

Aggregating data frame rows using an input vector

Categories

Resources