Elegant way to match and replace part of string - r

I have two data tables like namely dt and dt1
dt <- data.table(s=c("AA-AA-1", "BB-BB-2", "CC-CC-3"))
s
1 AA-AA-1
2 BB-BB-2
3 CC-CC-3
dt1 <- data.table(x=c(1,2,3), name=c("AA", "BB", "CC"))
x name
1: 1 AA
2: 2 BB
3: 3 CC
I need to replace part of string in s column of dt with name column of dt1 by matching number after last hyphen of s & x col in dt1 column in dt, so that dt becomes like this.
s
1: AA-AA-AA
2: BB-BB-BB
3: CC-CC-CC
I know we can do it by splitting s and matching
split <- lapply(strsplit(as.character(dt$s), split="-"), tail, n=1)
dt1$name[match(dt$split, dt1$x)
Is there any way to speed it up in elegant way?

Here is a base R approach. We can create an x column in the first dt data table, using the digit appearing to the right of the final dash. Then, we can merge the two data tables on x, and finally concatenate the s result you expect.
dt$x <- sub(".*-", "", dt$s)
result <- merge(dt, dt1, by="x")
result$s <- paste0(sub("\\d+", "", result$s), result$name)
result$s
[1] "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
Demo

I'd take the straightforward approach:
dt1[dt[, .(x = as.integer(sub('.*-', '', s)), str = sub('[^-]+$', '', s))],
on = .(x), .(s = paste0(str, name))]
# s
#1: AA-AA-AA
#2: BB-BB-BB
#3: CC-CC-CC

base R, sprintf + sub
mapply(sprintf, sub("[^-]+$", "%s", dt$s), dt1$name)
# AA-AA-%s BB-BB-%s CC-CC-%s
# "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
I presumed that both data frames are in a matching order (as they are in the example). If not, you need to match them before, for example:
mapply(sprintf, sub("-.?$", "-%s", dt$s), dt1$name[match(gsub("[^0-9]","", dt$s), dt1$x)])

Here is a slightly more general approach.
mapply(function(pat, repl, src){ sub(pat, repl, src) }, pat = dt1$x, repl = dt1$name, src = dt$s )
#[1] "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
If you say you always want to replace after last - (hyphen), then you can simplify as:
mapply(function(repl, src){ sub("(?<=-)[^-]+$", repl, src, perl = T) }, repl = dt1$name, src = dt$s )
Please note: My solution works only if dt and dt1 are ordered like in the example. This means each first rows are related, ... and so on. If this is not the case consider a combination of #Tims (the merging ...) and my solution.
So here is a rock-solid solution using some of Tim's ideas:
dt <- data.table(s=c("AA-AA-1", "BB-BB-2", "CC-CC-3"))
dt1 <- data.table(x=3:1, name=c("CC", "BB", "AA")) # note the order is not right.
dt$x <- sub(".*-", "", dt$s)
dt <- merge.default(dt, dt1, by="x")
dt$endResult <- mapply(function(repl, src){ sub("(?<=-)[^-]+$", repl, src, perl = T) }, repl = dt$name, src = dt$s )

If they are sorted appropriately as in your example you can use stringr::str_replace:
library(stringr)
dt[,s := str_replace(s,as.character(dt1$x),dt1$name)]
dt
# s
# 1: AA-AA-AA
# 2: BB-BB-BB
# 3: CC-CC-CC

Related

R Replace a value with a character from another data.table

Here I met some problems in R about replacement coding.
Here is the original data.table. There are two datatables:
dt1 <- data.table(V1 = c(1,"A"))
dt2 <- data.table("1" = c(4,5,6), "A" = c("c","d","e"))
Now I want to replace values in dt1 with value in dt2 by matching relationship.
The desired output should be:
dt3 <- data.table(V1 = c("4,5,6", "c,d,e"))
That is, I want to replace values in dt1 with all values in the corresponding column in dt2. And this is a simple example, I want to apply it to the whole data.table in R.
I met so big trouble in dealing with this, so please help me.
Here is a way from your input to desired output.
dt1[, V1 := sapply(dt2, paste, collapse = ',')[V1]]
# Test
all.equal(dt1, dt3)
[1] TRUE
PS. Are you sure that storing the values separated by a comma in a string is the best?
We may do
dt1[, V1 := unlist(lapply(V1, function(x) toString(dt2[[x]])))]
dt1
V1
1: 4, 5, 6
2: c, d, e

How to subtract two comma separated columns in R?

I have a small problem that I can't seem to solve. Given two columns:
dt <- data.table(ColumnA = c("A,B,C,A,A,A", "A,B,C"), ColumnB = c("A,C,A", "C"))
I would like to "subtract" columnB from columnA, which will result in:
data.table(Result = c("B,A,A", "A,B"))
How would one achieve this fact without first transforming it into a list and then try to subtract the list? In addition, since the dataset is quite big, it cannot be done using a for loop in R.
Every item in the comma seperated string should be treated as one item and should be subtracted only once if it occurs once. Hence not all A's are gone in the first row.
Another option leveraging the function vecsets::vsetdiff which doesn't remove duplicates:
library(dplyr)
library(tidyr)
library(purrr)
library(vecsets)
dt %>%
mutate(x = strsplit(ColumnA,","),
y = strsplit(ColumnB,",")) %>%
mutate(z = map2(x,y,vecsets::vsetdiff))
ColumnA ColumnB x y z
1 A,B,C,A,A,A A,C,A A, B, C, A, A, A A, C, A B, A, A
2 A,B,C C A, B, C C A, B
Note that you end up with list columns here (which I created on purpose for this to work), but the data might be easier to work with that way anyway.
sapply(1:nrow(dt), function(i){
a = dt$ColumnA[i]
b = unlist(strsplit(dt$ColumnB[i], ","))
for (x in b){
a = sub(paste0(x, ",?"), "", a)
}
sub(",$", "", a)
})
#[1] "B,A,A" "A,B"
Not sure if using string split fails this criteria:
How would one achieve this fact without first transforming it into a list and then try to subtract the list?
Will delete this post if OP decides that this violates OP's criteria.
Here is an option using data.table's anti-join. It takes about 7 seconds for 2 million rows:
library(data.table)
library(stringi) #for fast string processing
dt <- data.table(ColumnA = c("A,B,C,A,A,A", "A,B,C"), ColumnB = c("A,C,A", "C"))
DT_big <- dt[rep(seq(dt[, .N]), 1e6)]
f <- function(DT, coln) {
res <- DT[, {
s <- stri_split_fixed(get(coln), ',')
.(rn=rep(seq_along(s), lengths(s)), S=unlist(s))
}]
res[, n := rowid(S)]
}
system.time({
DTA <- f(DT_big, "ColumnA")
DTB <- f(DT_big, "ColumnB")
ans <- DTA[!DTB, on=.(rn, S, n)][, .(Result=paste(S, collapse=",")), .(rn)][, rn := NULL][]
})
ans
timing:
user system elapsed
7.56 0.33 7.20
output:
Result
1: B,A,A
2: A,B
3: A,B,A,A,A
4: A,B
5: A,B,A,A,A
---
1999996: A,B
1999997: A,B,A,A,A
1999998: A,B
1999999: A,B,A,A,A
2000000: A,B

Matching columns in 2 data frames when numbers don't exactly match

How do I match two different data frames when the values I am comparing are not exactly the same?
I was thinking of using merge() but I am not sure.
Table1:
ID Value.1
10001 x
18273-9 y
12824/5/6/7 z
10283/5/9 d
Table2:
ID Value.2
10001 a
18274 b
12826 c
10289 u
How do I merge Table 1 and 2 based on ID?
Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?
You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:
df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4", "12824/5/6/7","10283/5/9"),
value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"), stringsAsFactors = F)
doIt <- function(df){
listAsDF <- function(l) {
x <- stack(setNames(l, temp$value))
names(x) <- c("ID", "value")
return(x)
}
Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
#1 cases when - present
temp <- df[grep("\\-", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
Base <- rbind(Base, temp)
#2 cases when / present
temp <- df[grep("\\/", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
Base <- rbind(Base, temp)
return(Base)
}
Then you can mergge the df2 and df1:
merge(doIt(df1), df2, by = "ID", all.x = T)
Hope this helps!
You could use the fuzzy string matching function "agrep" from base R.
df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"),
value=c("a","c","d","k"))
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"))
apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))
As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.
One option could consist in extracting the format of ID you want to keep. And then do your merge.
You can format your ID column as follow :
library(stringr)
library(dplyr)
If you want only the digits before any symbols
Table1 %>% mutate(ID = str_extract("[0-9]*"))
If you want to keep the first sequence of 5 digits
Table1 %>% mutate(ID = str_extract("[0-9]{5}"))
This answers your second question, but does not use the fuzzyjoin package

format data.table column values (row wise) if numeric

I am creating a summary data.table to be inserted in a knitr report using xtable. I would like to check each row value in each column if is.numeric() == TRUE and if it is, format the number, then revert it back to a character. If is.numeric() == FALSE then return the value. The actual data.table may have many columns.
Here's what I have below, with the desired output at the bottom:
library(data.table)
library(magrittr)
dt <- data.table(A = c("apples",
"bananas",
1000000.999),
B = c("red",
5000000.999,
0.99))
dt
a <- dt[, lapply(.SD,
function(x) {
if (is.na(is.numeric(x))) {
prettyNum(as.numeric(x), digits = 0, big.mark = ",")
} else {
x
}
})]
a
b <- dt[, A := ifelse(is.na(is.numeric(A)),
format(as.numeric(A), digits = 0, big.mark = ","),
A)] %>%
.[, B := ifelse(is.na(is.numeric(B)),
format(as.numeric(B), digits = 0, big.mark = ","),
B)]
b
b
desired <- data.table(A = c("apples",
"bananas",
"1,000,000"),
B = c("red",
"5,000,000",
"1"))
desired
From my understanding lapply in the j argument of data.table syntax operates on the vector, so it can be used for functions like mean(), sum(), na.approx(), etc. and wouldn't necessarily work here. But I would like to loop over each column in the data.table without specifying each column name since there could be many columns and naming them would be cumbersome. It's kind of like I know the circle doesn't go in the square but I really want it to!
I tried the := ifelse() approach which I thought should work, but it seems to be returning the first element. On a different data.table where the column is entirely numeric, employing the same approach yields all NA.
Thanks for any help!
We can use set with number. Loop through the sequence of columns with a for loop, identify the index of elements that are all digits or . ('i1'), use that as the i in set, convert those elements to numeric, apply the number to set the format for that element
library(scales)
library(data.table)
for(j in seq_along(dt)) {
i1 <- grep("^[0-9.]+$", dt[[j]])
set(dt, i = i1, j = j, value = number(as.numeric(dt[[j]][i1]), big.mark = ","))
}
dt
# A B
#1: apples red
#2: bananas 5,000,001
#3: 1,000,001 1

select part of a string after certain number of special character

I have a data.table with a column
V1
a_b_c_las_poi
a_b_c_kiosk_pran
a_b_c_qwer_ok
I would like to add a new column to this data.table which does not include the last part of string after the "_".
UPDATE
So i would like the output to be
a_b_c_las
a_b_c_kiosk
a_b_c_qwer
If k is the number of fields to keep:
k <- 2
DT[, V1 := do.call(paste, c(read.table(text=V1, fill=TRUE, sep="_")[1:k], sep = "_"))]
fill=TRUE can be omitted if all rows have the same number of fields.
Note: DT in a reproducible form is:
library(data.table)
DF <- data.frame(V1 = c("a_b_c_las_poi", "a_b_c_kiosk_pran", "a_b_c_qwer_ok"),
stringsAsFactors = FALSE)
DT <- as.data.table(DF)
You can do this with sub and a regular expression.
sub("(.*)_.*", "\\1", V1)
[1] "a_b_c_las" "a_b_c_kiosk" "a_b_c_qwer"

Resources