How to subtract two comma separated columns in R? - r

I have a small problem that I can't seem to solve. Given two columns:
dt <- data.table(ColumnA = c("A,B,C,A,A,A", "A,B,C"), ColumnB = c("A,C,A", "C"))
I would like to "subtract" columnB from columnA, which will result in:
data.table(Result = c("B,A,A", "A,B"))
How would one achieve this fact without first transforming it into a list and then try to subtract the list? In addition, since the dataset is quite big, it cannot be done using a for loop in R.
Every item in the comma seperated string should be treated as one item and should be subtracted only once if it occurs once. Hence not all A's are gone in the first row.

Another option leveraging the function vecsets::vsetdiff which doesn't remove duplicates:
library(dplyr)
library(tidyr)
library(purrr)
library(vecsets)
dt %>%
mutate(x = strsplit(ColumnA,","),
y = strsplit(ColumnB,",")) %>%
mutate(z = map2(x,y,vecsets::vsetdiff))
ColumnA ColumnB x y z
1 A,B,C,A,A,A A,C,A A, B, C, A, A, A A, C, A B, A, A
2 A,B,C C A, B, C C A, B
Note that you end up with list columns here (which I created on purpose for this to work), but the data might be easier to work with that way anyway.

sapply(1:nrow(dt), function(i){
a = dt$ColumnA[i]
b = unlist(strsplit(dt$ColumnB[i], ","))
for (x in b){
a = sub(paste0(x, ",?"), "", a)
}
sub(",$", "", a)
})
#[1] "B,A,A" "A,B"

Not sure if using string split fails this criteria:
How would one achieve this fact without first transforming it into a list and then try to subtract the list?
Will delete this post if OP decides that this violates OP's criteria.
Here is an option using data.table's anti-join. It takes about 7 seconds for 2 million rows:
library(data.table)
library(stringi) #for fast string processing
dt <- data.table(ColumnA = c("A,B,C,A,A,A", "A,B,C"), ColumnB = c("A,C,A", "C"))
DT_big <- dt[rep(seq(dt[, .N]), 1e6)]
f <- function(DT, coln) {
res <- DT[, {
s <- stri_split_fixed(get(coln), ',')
.(rn=rep(seq_along(s), lengths(s)), S=unlist(s))
}]
res[, n := rowid(S)]
}
system.time({
DTA <- f(DT_big, "ColumnA")
DTB <- f(DT_big, "ColumnB")
ans <- DTA[!DTB, on=.(rn, S, n)][, .(Result=paste(S, collapse=",")), .(rn)][, rn := NULL][]
})
ans
timing:
user system elapsed
7.56 0.33 7.20
output:
Result
1: B,A,A
2: A,B
3: A,B,A,A,A
4: A,B
5: A,B,A,A,A
---
1999996: A,B
1999997: A,B,A,A,A
1999998: A,B
1999999: A,B,A,A,A
2000000: A,B

Related

Applying a function to the whole data table by groups

Let's suppose the following data table:
a = runif(40)
b = c(rep(NA,5), runif(5), rep(NA,3),runif(3),NA,runif(3), c(rep(NA,3), runif(7), rep(NA,4), runif(3), NA,NA, runif(1)))
c = rep(1:4,each=10)
DT = data.table(a,b,c)
I want to eliminate the rows with the first NA values in b for every unique value in c (first NAs when c==1, when c==2...), but not the rows with the NAs that come after.
I can do it by using a loop:
for(i in unique(DT$c))
{
first_NA = which(DT$c==i)[1]
last_NA = which(!is.na(DT[,b]) & DT$c==i)[1] - 1
DT = DT[-c(first_NA:last_NA)]
}
But I wonder if there is any simpler way of doing this by using a function for the whole data table using groups (by in data table or groupby in dplyr), without just applying it to columns.
Thank you!
You can filter out the first NA values in b through
DT[, .SD[cumsum( !is.na(b) ) != 0], by = .(c)]
You have to mark these lines then keep those not marked.
# mark values
DT <- DT[, by=c,
flag := is.na(b[1]) # first value of b is NA
& (seq_len(.N)==1) # only for first value
]
# discard marked
DT <- DT[(!flag)]
# remove flag
DT[, flag:=NULL]
or in a row
DT[, by=c, flag:=is.na(b[1]) & (seq_len(.N)==1)][(!flag)][, flag:=NULL]

Elegant way to match and replace part of string

I have two data tables like namely dt and dt1
dt <- data.table(s=c("AA-AA-1", "BB-BB-2", "CC-CC-3"))
s
1 AA-AA-1
2 BB-BB-2
3 CC-CC-3
dt1 <- data.table(x=c(1,2,3), name=c("AA", "BB", "CC"))
x name
1: 1 AA
2: 2 BB
3: 3 CC
I need to replace part of string in s column of dt with name column of dt1 by matching number after last hyphen of s & x col in dt1 column in dt, so that dt becomes like this.
s
1: AA-AA-AA
2: BB-BB-BB
3: CC-CC-CC
I know we can do it by splitting s and matching
split <- lapply(strsplit(as.character(dt$s), split="-"), tail, n=1)
dt1$name[match(dt$split, dt1$x)
Is there any way to speed it up in elegant way?
Here is a base R approach. We can create an x column in the first dt data table, using the digit appearing to the right of the final dash. Then, we can merge the two data tables on x, and finally concatenate the s result you expect.
dt$x <- sub(".*-", "", dt$s)
result <- merge(dt, dt1, by="x")
result$s <- paste0(sub("\\d+", "", result$s), result$name)
result$s
[1] "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
Demo
I'd take the straightforward approach:
dt1[dt[, .(x = as.integer(sub('.*-', '', s)), str = sub('[^-]+$', '', s))],
on = .(x), .(s = paste0(str, name))]
# s
#1: AA-AA-AA
#2: BB-BB-BB
#3: CC-CC-CC
base R, sprintf + sub
mapply(sprintf, sub("[^-]+$", "%s", dt$s), dt1$name)
# AA-AA-%s BB-BB-%s CC-CC-%s
# "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
I presumed that both data frames are in a matching order (as they are in the example). If not, you need to match them before, for example:
mapply(sprintf, sub("-.?$", "-%s", dt$s), dt1$name[match(gsub("[^0-9]","", dt$s), dt1$x)])
Here is a slightly more general approach.
mapply(function(pat, repl, src){ sub(pat, repl, src) }, pat = dt1$x, repl = dt1$name, src = dt$s )
#[1] "AA-AA-AA" "BB-BB-BB" "CC-CC-CC"
If you say you always want to replace after last - (hyphen), then you can simplify as:
mapply(function(repl, src){ sub("(?<=-)[^-]+$", repl, src, perl = T) }, repl = dt1$name, src = dt$s )
Please note: My solution works only if dt and dt1 are ordered like in the example. This means each first rows are related, ... and so on. If this is not the case consider a combination of #Tims (the merging ...) and my solution.
So here is a rock-solid solution using some of Tim's ideas:
dt <- data.table(s=c("AA-AA-1", "BB-BB-2", "CC-CC-3"))
dt1 <- data.table(x=3:1, name=c("CC", "BB", "AA")) # note the order is not right.
dt$x <- sub(".*-", "", dt$s)
dt <- merge.default(dt, dt1, by="x")
dt$endResult <- mapply(function(repl, src){ sub("(?<=-)[^-]+$", repl, src, perl = T) }, repl = dt$name, src = dt$s )
If they are sorted appropriately as in your example you can use stringr::str_replace:
library(stringr)
dt[,s := str_replace(s,as.character(dt1$x),dt1$name)]
dt
# s
# 1: AA-AA-AA
# 2: BB-BB-BB
# 3: CC-CC-CC

Group individual level data into grouped level list and search for presence %in% that list

in R I have a twofold problem.
First,
I would like to transform my data from this:
d <- data.table(
person_id=1:10,
cat=letters[1:10],
group_id=c(rep(1,5),rep(2,5))
)
Into this:
d_grouped <- data.table(
group_id=1:2
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10] )
i.e. group my data, from person level to group level, but keeping the information on individual characteristics into a column containing list of person level characteristics for each group.
How can I do this aggregation?
Preferably a data.table solution. But it could also be a normal data.frame.
Second,
I would like to search for presence of the elements of a vector in each list of each group
Something like (I know this is not correct syntax):
c('a','b') %in% d_grouped$Cat_grouped
which should return another list:
list(c(T,T),c(F,F))
More broadly, I am trying to merge lists (A and B) both containing vectors. The match should be based on the elements of a vector in list A being present on a vector in list B. Is there any merge command based on this SubVector logic?
To accomplish the first transformation,
d[, list(Cat_grouped=paste0(cat, collapse = ',')), group_id]
To accomplish the second, it seems as though your best bet is to leave the data in the original shape? After all
d[, c('a', 'b') %in% cat, group_id]
returns
group_id V1
1: 1 TRUE
2: 1 TRUE
3: 2 FALSE
4: 2 FALSE
All this being said your "more broadly" appears to be asking for something else which I fear is not quite what I've helped you with by answering the two specific questions. Perhaps you could provide another example?
Just do it in data.table, returning a list for each by= group:
d[, .(cat_grouped=.(cat)), by=group_id]
# group_id cat_grouped
#1: 1 a,b,c,d,e
#2: 2 f,g,h,i,j
I tend to agree with #HarlandMason's answer that the analysis you are doing does not however require this intermediate data.table.
Base R solution using aggregate
d2 = aggregate(list(cat = d$cat), list(group = d$group_id), function(x)
as.character(x), simplify = FALSE)
d2
# group cat
#1 1 a, b, c, d, e
#2 2 f, g, h, i, j
lapply(d2$cat, function(x) c("a","b") %in% x)
#$`1`
#[1] TRUE TRUE
#$`2`
#[1] FALSE FALSE
Also consider
mylist = split(d$cat, d$group_id)
We can also use dplyr
library(dplyr)
d %>%
group_by(group_id) %>%
summarise(cat = list(cat))

How to compare two columns in R data frame and return 0 or 1 in the third column based on the comparison?

I have a dataframe with two columns(both are dates) and a million rows. I have to compare both the dates and return value in the third column. i.e if date in column A is greater than date in column B, return 1 in column C.
Thanks in advance :)
In base:
DF$C <- as.numeric(DF$A > DF$B)
In dplyr:
DF %>%
mutate(C = as.numeric(A > B))
library(data.table)
dt <- as.data.table(dt)
dt$A <- as.Date(dt$A)
dt$B <- as.Date(dt$B)
Here are two ways you can try:
dt[, C := ifelse(A > B, 1, 0)]
or
dt[, C := 0][A > B, C := 1]
In second way, you can change to dt[, C := 1][A <= B, C := 0] by checking which has less obs.
Maybe you need to provide a little reproducible example.

How to merge two columns in R with a specific symbol?

I have a table read in R as follows:
column1 column2
A B
What is the command to be used to match two columns together as follows?
Column 3
A_B
I'm a bit unsure what you mean by "merge", but is this what you mean?
> DF = data.frame(A = LETTERS[1:10], B = LETTERS[11:20])
> DF$C = paste(DF$A, DF$B, sep="_")
> head(DF)
A B C
1 A K A_K
2 B L B_L
3 C M C_M
4 D N D_N
Or equivalently, as #daroczig points out:
within(DF, C <- paste(A, B, sep='_'))
My personal favourite involves making use of the unite in tidyr:
set.seed(1)
df <- data.frame(colA = sample(LETTERS, 10),
colB = sample(LETTERS, 10))
# packs: pipe + unite
require(magrittr); require(tidyr)
# Unite
df %<>%
unite(ColAandB, colA, colB, remove = FALSE)
Results
> head(df, 3)
ColAandB colA colB
1 G_F G F
2 J_E J E
3 N_Q N Q
Side notes
Personally, I find the remove = TRUE / FALSE functionality of unite very useful. In addition tidyr firs the dplyr workflow very well and plays well with separate in case you change your mind about the columns being merged. On the same lines, if NAs are the problem introducing na.omit to your workflow would enable you to conveniently drop the undesirable rows before creating the desired column.

Resources