Finding and matching reversed strings efficiently in R - r

I have a large number of strings (~280,000) that all have the following format "ABC12D/XYZ34A". In my data, each of those strings has a duplicate entry that is identical but in reverse, e.g. "XYZ34A/ABC12D" for the example above. So, my data looks something like this:
1 "ABC12D/XYZ34A"
2 "TUR44F/SWP29R"
3 "PLL93S/WQQ22F"
4 "YNV77C/AAZ05S"
5 "SWP29R/TUR44F"
6 "AAZ05S/YNV77C"
7 "CLK86G/ERF74Q"
8 "XYZ34A/ABC12D"
9 "ERF74Q/CLK86G"
10 "WQQ22F/PLL93S"
Row 1 matches row 8, row 2 matches row 5, etc.
My aims are: 1) for a given string, find where its reversed entry is and keep this index and then 2) replace the reverse entry with the non-reverse entry:
1 "ABC12D/XYZ34A" 8
2 "TUR44F/SWP29R" 5
3 "PLL93S/WQQ22F" 10
4 "YNV77C/AAZ05S" 6
5 "TUR44F/SWP29R" 0
6 "YNV77C/AAZ05S" 0
7 "CLK86G/ERF74Q" 9
8 "ABC12D/XYZ34A" 0
9 "CLK86G/ERF74Q" 0
10 "PLL93S/WQQ22F" 0
Currently, I do this in the following way using a loop:
df <- data.frame(c("ABC12D/XYZ34A", "TUR44F/SWP29R", "PLL93S/WQQ22F",
"YNV77C/AAZ05S", "SWP29R/TUR44F", "AAZ05S/YNV77C", "CLK86G/ERF74Q",
"XYZ34A/ABC12D", "ERF74Q/CLK86G", "WQQ22F/PLL93S"), stringsAsFactors =
FALSE)
colnames(df) <- "entries"
df
# Reverse function
reverse.entry <- function(string) {
string.reversed <- paste(rev(strsplit(string, "/")[[1]]), collapse = '/')
string.reversed
}
duplicate.flag <- list()
duplicate.idx <- list()
# Find and replace reversed entries
for (i in 1:dim(df)[[1]]) {
# current entry
string = df[i,]
# reverse the current entry
string.reversed <- reverse.entry(string)
# if any other entry matches the reversed string get match index
if (grepl(string.reversed, df)) {
print(sprintf("%d found a reversal", i))
idx <- which(df == string.reversed)
duplicate.flag[i] <- 1;
duplicate.idx[i] <- idx;
# replace reversed strings with original strings
df[idx,] <- string
} else {
duplicate.flag[i] <- 0;
duplicate.idx[i] <- 0;
}
}
data.frame(df, unlist(duplicate.idx), unlist(duplicate.flag))
However, this is quite slow and is taking several hours. Is there a better way of programming this? I'm fairly new to R and programming so am not terribly good at vectorization etc. Since each entry has one reverse entry, I could also just have the loop for 1:dim(df)[[1]] / 2. Would that already save a lot of time?
Many thanks!

You could do something like this...
df$no <- seq_along(df$entries) #number the entries
df$rev <- gsub("(.+)/(.+)","\\2/\\1",df$entries) #calculate reverse entries
df$whererev <- match(df$rev, df$entries) #identify where reversed entries occur
df$whererev[df$whererev>df$no] <- NA #remove the first of each duplicated pair
df$entries[!is.na(df$whererev)] <- df$rev[!is.na(df$whererev)] #replace duplicates
df
no entries rev whererev
1 1 ABC12D/XYZ34A XYZ34A/ABC12D NA
2 2 TUR44F/SWP29R SWP29R/TUR44F NA
3 3 PLL93S/WQQ22F WQQ22F/PLL93S NA
4 4 YNV77C/AAZ05S AAZ05S/YNV77C NA
5 5 TUR44F/SWP29R TUR44F/SWP29R 2
6 6 YNV77C/AAZ05S YNV77C/AAZ05S 4
7 7 CLK86G/ERF74Q ERF74Q/CLK86G NA
8 8 ABC12D/XYZ34A ABC12D/XYZ34A 1
9 9 CLK86G/ERF74Q CLK86G/ERF74Q 7
10 10 PLL93S/WQQ22F PLL93S/WQQ22F 3
Note that I have marked the second duplicate rather than the first, as this makes it easier (and probably substantially quicker) to replace the second one, rather than having to look it up from the first one. (Line 4 would have < rather than > if you wanted to recreate your marking of the first of each duplicated pair).

Here's my solution:
require(data.table)
get_index <- function(string,values,current_index){
string_present <- match(string,values)
string_present[string_present<current_index] <- 0
return(string_present)
}
mydata <- c("ABC12D/XYZ34A","TUR44F/SWP29R","PLL93S/WQQ22F","YNV77C/AAZ05S","SWP29R/TUR44F","AAZ05S/YNV77C","CLK86G/ERF74Q","XYZ34A/ABC12D","ERF74Q/CLK86G","WQQ22F/PLL93S")
mydf <- data.table(mystring = mydata,stringsAsFactors = FALSE)
mydf[,revmystring:=gsub("(.+)\\/(.+)","\\2\\/\\1",mystring)]
mydf[,duplicate_index:=get_index(revmystring,mystring,.I)]
The solution it gives is:
> mydf
mystring revmystring duplicate_index
1: ABC12D/XYZ34A XYZ34A/ABC12D 8
2: TUR44F/SWP29R SWP29R/TUR44F 5
3: PLL93S/WQQ22F WQQ22F/PLL93S 10
4: YNV77C/AAZ05S AAZ05S/YNV77C 6
5: SWP29R/TUR44F TUR44F/SWP29R 0
6: AAZ05S/YNV77C YNV77C/AAZ05S 0
7: CLK86G/ERF74Q ERF74Q/CLK86G 9
8: XYZ34A/ABC12D ABC12D/XYZ34A 0
9: ERF74Q/CLK86G CLK86G/ERF74Q 0
10: WQQ22F/PLL93S PLL93S/WQQ22F 0
You can implement this without data.table as well.

Here is a propostion using outer and gsub:
## Create a matrix of correspondence o between elements and reverses
o = outer(df[,1],df[,1],function(x,y) gsub("(.*)/(.*)","\\2/\\1",y)==x)
o[upper.tri(o)] = F
## Identify the indices of correspondence
df$ind = unlist(apply(o,2,function(x) which(x==T)[1]))
df$ind[is.na(df$ind)] = 0
## Replace reverses by originals
df[,1][df$ind[df$ind!=0]] = df[,1][df$ind!=0]
This returns:
V1 ind
1 ABC12D/XYZ34A 8
2 TUR44F/SWP29R 5
3 PLL93S/WQQ22F 10
4 YNV77C/AAZ05S 6
5 TUR44F/SWP29R 0
6 YNV77C/AAZ05S 0
7 CLK86G/ERF74Q 9
8 ABC12D/XYZ34A 0
9 CLK86G/ERF74Q 0
10 PLL93S/WQQ22F 0

Related

Comparing items in a list to a dataset in R

I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA

Need help concatenating column names

I am generating 5 different prediction and adding those predictions to an existing data frame. My code is:
For j in i{
…
actual.predicted <- data.frame(test_data, predicted)
}
I am trying to concatenate words together to create new column names, in the loop. Specifically, I have a column named “predicted” and I am generating predictions in each iteration of the loop. So, in the first iteration, I want the new column name to be “predicted.1” and for the second iteration, the new column name should be “predicted.2” and so on.
Any thoughts would be greatly appreciated.
You may not even need to use a loop here, but assuming you do, one pattern which might work well here would be to use a list:
results <- list()
for j in i {
# do something involving j
name <- paste0("predicted.", j)
results[[name]] <- data.frame(test_data, predicted)
}
One option is to set the names after assigning new columns
actual.predicted <- data.frame(orig_col = sample(10))
for (j in 1:5){
new_col = sample(10)
actual.predicted <- cbind(actual.predicted, new_col)
names(actual.predicted)[length(actual.predicted)] <- paste0('predicted.',j)
}
actual.predicted
# orig_col predicted.1 predicted.2 predicted.3 predicted.4 predicted.5
# 1 1 4 4 9 1 5
# 2 10 2 3 7 5 9
# 3 8 6 5 4 2 3
# 4 5 9 9 10 7 7
# 5 2 1 10 8 3 10
# 6 9 7 6 6 8 6
# 7 7 8 7 2 4 2
# 8 3 3 1 1 6 8
# 9 6 10 2 3 9 4
# 10 4 5 8 5 10 1

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

For Loop Function in R

I have been struggling to figure out why I am not returning the correct values to my data frame from my function. I want to loop through a vector of my data frame and create a new column by a calculation within the vector's elements. Here's what I have:
# x will be the data frame's vector
y <- function(x){
new <- c()
for (i in x){
new <- c(new, x[i] - x[i+1])
}
return (new)
}
So here I want to create a new vector that returns the next element subtracted from current element. Now, when I apply it to my data frame
df$new <- lapply(df$I, y)
I get all NAs. I know I'm missing something completely obvious...
Also, how would I execute the function that resets itself if df$ID changes so I am not subtracting elements from two different df$IDs? For example, my data frame will have
ID I Order new
1001 5 1 1
1001 6 2 -2
1001 4 3 -2
1001 2 4 NA
1005 2 1 6
1005 8 2 0
1005 8 3 -2
1005 6 4 NA
Thanks!
Avoid the loop and use diff. Everything is vectorized here so it's easy.
df$new <- c(diff(df$I), NA)
But I don't understand your example result. Why are some 0 values changed to NA and some are not? And shouldn't 8-2 be 6 and not -6? I think that needs to be clarified.
If the 0 values need to be changed to NA, just do the following after the above code.
df$new[df$new == 0] <- NA
A one-liner of the complete process, that returns the new data frame, can be
within(df, { new <- c(diff(I), NA); new[new == 0] <- NA })
Update : With respect to your comments below, my updated answer follows.
> M <- do.call(rbind, Map(function(x) { x$z <- c(diff(x$I), NA); x },
split(dat, dat$ID)))
> rownames(M) <- NULL
> M
ID I Order z
1 1001 5 1 1
2 1001 6 2 -2
3 1001 4 3 -2
4 1001 2 4 NA
5 1005 2 1 6
6 1005 8 2 0
7 1005 8 3 -2
8 1005 6 4 NA
The dplyr library makes it very easy to do things separately for each level of a grouping variable, in your case ID. We can use diff as #Richard Scriven recommends, and use dplyr::mutate to add a new column.
> library(dplyr)
> df %>% group_by(ID) %>% mutate(new2 = c(diff(I), NA))
Source: local data frame [8 x 5]
Groups: ID
ID I Order new new2
1 1001 5 1 1 1
2 1001 6 2 -2 -2
3 1001 4 3 -2 -2
4 1001 2 4 NA NA
5 1005 2 1 6 6
6 1005 8 2 0 0
7 1005 8 3 -2 -2
8 1005 6 4 NA NA
Rather than a loop, you would be better off using a vector version of the math. The exact indices will depend on what you want to do with the last value... (Note this line is not placed into your for loop, but just gives the result.)
df$new = c(df$I[-1],NA) - df$I
Here you will be subtracting the original df$I from a shifted version that omits the first value [-1] and appends a NA at the end.
EDIT per comments: If you don't want to subtract across df$ID, you can blank out that subset of cells after subtraction:
df$new[df$ID != c(df$ID[-1],NA)] = NA

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources