Finding "similar" rows performing a conditional join with sqldf - r

Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?

Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()

I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.

Related

Compute combination of a pair variables for a given operation in R

From a given dataframe:
# Create dataframe with 4 variables and 10 obs
set.seed(1)
df<-data.frame(replicate(4,sample(0:1,10,rep=TRUE)))
I would like to compute a substract operation between in all columns combinations by pairs, but only keeping one substact, i.e column A- column B but not column B-column A and so on.
What I got is very manual, and this tend to be not so easy when there are lots of variables.
# Result
df_result <- as.data.frame(list(df$X1-df$X2,
df$X1-df$X3,
df$X1-df$X4,
df$X2-df$X3,
df$X2-df$X4,
df$X3-df$X4))
Also the colname of the feature name should describe the operation i.e.(x1_x2) being x1-x2.
You can use combn:
COMBI = combn(colnames(df),2)
res = data.frame(apply(COMBI,2,function(i)df[,i[1]]-df[,i[2]]))
colnames(res) = apply(COMBI,2,paste0,collapse="minus")
head(res)
X1minusX2 X1minusX3 X1minusX4 X2minusX3 X2minusX4 X3minusX4
1 0 0 -1 0 -1 -1
2 1 1 0 0 -1 -1
3 0 0 0 0 0 0
4 0 0 -1 0 -1 -1
5 1 1 1 0 0 0
6 -1 0 0 1 1 0

Create a new conditional column based on binary values in other columns? (R)

My data looks like this:
A B C
1 0 0
0 1 0
0 1 0
0 0 1
This is what I’m going for:
A B C New_Column
1 0 0 A
0 1 0 B
0 1 0 B
0 0 1 C
So I'm creating a new column that tells me which of the three variables (A, B, or C) is present. Only one of the three columns will contain a 1 per row. What is the best way to go about this?
We can use max.col
df1$New_Column <- names(df1)[max.col(df1, "first")]

How to convert the result of xtabs() into dataframe in R? [duplicate]

This question already has answers here:
How to convert a table to a data frame
(5 answers)
Closed 4 years ago.
I have data like dataframe df_a, and want to have it converted to the format as in dataframe df_b.
xtabs() gives similar result, but I did not find a way to access elements as in the example code below. Accessing through xa[1,1] gives no advantage since there is a weak correlation between indexing by numbers ("1") and names ("A"). As you can see there is a sort difference in the xtabs() result, so xa[2,2]=2 and not 0 as on the df_b listing.
> df_a
ItemName Feature Amount
1 First A 2
2 First B 3
3 First A 4
4 Second C 3
5 Second C 2
6 Third D 1
7 Fourth B 2
8 Fourth D 3
9 Fourth D 2
> df_b
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
> df_b$A
[1] 6 0 0 0
> xa<-xtabs(df_a$Amount~df_a$ItemName+df_a$Feature)
> xa
df_a$Feature
df_a$ItemName A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
> xa$A
Error in xa$A : $ operator is invalid for atomic vectors
There is a way of iterative conversion with for() loops, but totally inefficient in my case because my data has millions of records.
For the purpose of further processing my required output format is dataframe.
If anyone solved similar problem please share.
You can just use as.data.frame.matrix(xa)
# output
A B C D
First 6 3 0 0
Fourth 0 2 0 5
Second 0 0 5 0
Third 0 0 0 1
## or
df_b <- as.data.frame.matrix(xa)[unique(df_a$ItemName), ]
data.frame(ItemName = row.names(df_b), df_b, row.names = NULL)
# output
ItemName A B C D
1 First 6 3 0 0
2 Second 0 0 5 0
3 Third 0 0 0 1
4 Fourth 0 2 0 5
Without using xtabs you can do something like this:
df %>%
dplyr::group_by(ItemName, Feature) %>%
dplyr::summarise(Sum=sum(Amount, na.rm = T)) %>%
tidyr::spread(Feature, Sum, fill=0) %>%
as.data.frame()
This will transform as you require and it stays as a data.frame
Or, you can just as.data.frame(your_xtabs_result) and that should work too

How to refer to previous cell in a data-frame column (lagged cell), in R

I’m working in R and am trying to find a way to refer to the previous cell within a vector when that vector belongs to a data frame. By previous cell, I’m essentially hoping for a “lag” command of some sort so that I can compare one cell to the cell previous. As an example, I have these data:
A <- c(1,0,0,0,1,0,0)
B <- c(1,1,1,1,1,0,0)
AB_df <- cbind (A,B)
What I want is for a given cell in a given row, if that cell’s value is less than the previous cell’s value for the same column vector, to return a value of 1 and if not to return a value of 0. For this example, the new columns would be called “A-flag” and “B-flag” below.
A B A-flag B-flag
1 1 0 0
0 1 1 0
0 1 0 0
0 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
Any suggestions for syntax that can do this? Ideally, to just create a new column variable into an existing data-frame.
Here is one solution using dplyr package and it's lag method:
library(dplyr)
AB_df <- data.frame(A = A, B = B)
AB_df %>% mutate(A.flag = ifelse(A < lag(A, default = 0), 1, 0),
B.flag = ifelse(B < lag(B, default = 0), 1, 0))
A B A.flag B.flag
1 1 1 0 0
2 0 1 1 0
3 0 1 0 0
4 0 1 0 0
5 1 1 0 0
6 0 0 1 1
7 0 0 0 0

Create virtual categorical column based on existing values

Reference:
Transpose and create categorical values in R
Follow-up to this question. While both model.matrix and data.table work very well with values already in it, how can we use them to simulate a column?
Meaning, from data in the same data frame,
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
If I were to simulate the case statement with OR condition from SQL in R, how do I go about it? In SQL I would do:
case when ( sex = 'F' OR sex = 'M') AND CONTROL IS NOT NULL THEN 1 ELSE 0 AS F_M_CONTROL
case when (sex = 'F' OR sex = 'M') AND COND1 IS NOT NULL THEN 1 ELSE 0 AS F_M_COND1
bringing the output to:
subject weight control_F_M control_M condtrol_F cond1_F_M cond1_F cond1_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
Any idea how I can generate the "Control_F_M" and Cond1_F_M columns in R?
Thanks in advance,
Bee
Edit:
To generate the afore mentioned output, i'm using the data table & dcast as suggested before.
I can use If-Else if I knew all the values in the column: test. I apologize for not clarifying this earlier. The challenge ofcourse is that the column is dynamic and so I'm hoping to generate that many columns dynamically as an extension to the below or using a similar approach.
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))

Resources