Calculate Euclidian distances between elements of two data sets - r

I have two data sets of the same size:
>df1
c d e
a 2 3 4
b 5 1 3
>df2
h i j
f 1 1 2
g 0 4 3
I need to calculate Euclidian distances between the same elements of these data sets to get:
c d e
a 1 2 2
b 5 3 0
I have tried using dist(rbind(df1, df2)), but the result gave only one entry.
I have to perform this operation with numerous data sets, that's why your help will be really appreciated.

The following will work if the data frames are all numeric and have the same column and row numbers.
df3 <- abs(df1 - df2)
df3
# c d e
# a 1 2 2
# b 5 3 0
DATA
df1 <- read.table(text = " c d e
a 2 3 4
b 5 4 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)
df2 <- read.table(text = " h i j
f 1 1 2
g 0 1 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)

Given your update the solution would be to do absolute value (abs) of the difference:
abs(df1 - df2)
And you could make a function if you want to repeat the process a lot:
myfunc1 <- function(x1,x2){
abs(x1 - x2)
}
myfunc1(df1, df2)
The output looks as intended:
[,1] [,2] [,3]
[1,] 1 2 2
[2,] 5 3 0

Related

Find values in data frame 2 which is found in data frame 1, within a certain range

I want to find which values in df2 which is also present in df1, within a certain range. One value is considering both a and b in the data frames (a & b can't split up). For examples, can I find 9,1 (df1[1,1]) in df2? It doesn't have to be on the same position. Also, we can allow a diff of for example 1 for "a" and 1 for "b". For example, I want to find all values 9+-1,1+-1 in df2. "a" & "b" always go together, each row stick together. Does anyone have a suggestion of how to code this? Many many thanks!
set.seed(1)
a <- sample(10,5)
set.seed(1)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df1 <- data.frame(feature,a,b)
df1
> df1
feature a b
A 9 1
B 4 4
C 7 1
D 1 2
E 2 5
set.seed(2)
a <- sample(10,5)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df2 <- data.frame(feature,a,b)
df2
df2
feature a b
A 5 1
B 6 4
C 9 5
D 1 1
E 10 2
Not correct but Im imaging this can be done for a for loop somehow!
for(i in df1[,1]) {
for(j in df1[,2]){
s<- c(s,(df1[i,1] & df1[j,2]== df2[,1] & df2[,2]))# how to add certain allowed diff levels?
}
}
s
Output wanted:
feature_df1 <- LETTERS[1:5]
match <- c(1,0,0,1,0)
feature_df2 <- c("E","","","D", "")
df <- data.frame(feature_df1, match, feature_df2)
df
feature_df1 match feature_df2
A 1 E
B 0
C 0
D 1 D
E 0
I loooove data.table, which is (imo) the weapon of choice for these kind of problems..
library( data.table )
#make df1 and df2 a data.table
setDT(df1, key = "feature"); setDT(df2)
#now perform a join operation on each row of df1,
# creating an on-the-fly subset of df2
df1[ df1, c( "match", "feature_df2") := {
val = df2[ a %between% c( i.a - 1, i.a + 1) & b %between% c(i.b - 1, i.b + 1 ), ]
unique_val = sort( unique( val$feature ) )
num_val = length( unique_val )
list( num_val, paste0( unique_val, collapse = ";" ) )
}, by = .EACHI ][]
# feature a b match feature_df2
# 1: A 9 1 1 E
# 2: B 4 4 0
# 3: C 7 1 0
# 4: D 1 2 1 D
# 5: E 2 5 0
One way to go about this in Base R would be to split the data.frames() into a list of rows then calculate the absolute difference of row vectors to then evaluate how large the absolute difference is and if said difference is larger than a given value.
Code
# Find the absolute difference of all row vectors
listdif <- lapply(l1, function(x){
lapply(l2, function(y){
abs(x - y)
})
})
# Then flatten the list to a list of data.frames
listdifflat <- lapply(listdif, function(x){
do.call(rbind, x)
})
# Finally see if a pair of numbers is within our threshhold or not
m1 <- 2
m2 <- 3
listfin <- Map(function(x){
x[1] > m1 | x[2] > m2
},
listdifflat)
head(listfin, 1)
[[1]]
V1
[1,] TRUE
[2,] FALSE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE
[7,] TRUE
[8,] TRUE
[9,] TRUE
[10,] TRUE
Data
df1 <- read.table(text = "
4 1
7 5
1 5
2 10
13 6
19 10
11 7
17 9
14 5
3 5")
df2 <- read.table(text = "
15 1
6 3
19 6
8 2
1 3
13 7
16 8
12 7
9 1
2 6")
# convert df to list of row vectors
l1<- lapply(1:nrow(df1), function(x){
df1[x, ]
})
l2 <- lapply(1:nrow(df2), function(x){
df2[x, ]
})

R merge() rbinds instead of merging

I ran across a behaviour of merge() in R that I can't understand. It seems that it either merges or rbinds data frames depending on whether a column has one or more unique values in it.
a1 <- data.frame (A = c (1, 1))
a2 <- data.frame (A = c (1, 2))
# > merge (a1, a1)
# A
# 1 1
# 2 1
# 3 1
# 4 1
# > merge (a2, a2)
# A
# 1 1
# 2 2
The latter is the result that I would expect, and want, in both cases. I also tried having more than one column, as well as characters instead of numerals, and the results are the same: multiple values result in merging, one unique value results in rbinding.
In the first case each row matches two rows so there are 2x2=4 rows in the output and in the second case each row matches one row so there are 2 rows in the output.
To match on row number use this:
merge(a1, a1, by = 0)
## Row.names A.x A.y
## 1 1 1 1
## 2 2 1 1
or match on row number and only return the left instance:
library(sqldf)
sqldf("select x.* from a1 x left join a1 y on x.rowid = y.rowid")
## A
## 1 1
## 2 1
or match on row number and return both instances:
sqldf("select x.A A1, y.A A2 from a1 x left join a1 y on x.rowid = y.rowid")
## A1 A2
## 1 1 1
## 2 1 1
The behaviour is detailed in the documentation but, basically, merge() will, by default, want to give you a data.frame with columns taken from both original dfs. It is going to merge rows of the two by unique values of all common columns.
df1 <- data.frame(a = 1:3, b = letters[1:3])
df2 <- data.frame(a = 1:5, c = LETTERS[1:5])
df1
a b
1 1 a
2 2 b
3 3 c
df2
a c
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
merge(df1, df2)
a b c
1 1 a A
2 2 b B
3 3 c C
What's happening in your first example is that merge() wants to combine the rows of your two data frames by the A column but because both rows in both dfs are the same, it can't figure out which row to merge with which so it creates all possible combinations.
In your second example, you don't have this problem and so merging is unambiguous. The 1 rows will get merged together as will the 2 rows.
The scenarios are more apparent when you have multiple columns in your dfs:
Case 1:
> df1 <- data.frame(a = c(1, 1), b = letters[1:2])
> df2 <- data.frame(a = c(1, 1), c = LETTERS[1:2])
> df1
a b
1 1 a
2 1 b
> df2
a c
1 1 A
2 1 B
> merge(df1, df2)
a b c
1 1 a A
2 1 a B
3 1 b A
4 1 b B
Case 2:
> df1 <- data.frame(a = c(1, 2), b = letters[1:2])
> df2 <- data.frame(a = c(1, 2), c = LETTERS[1:2])
> df1
a b
1 1 a
2 2 b
> df2
a c
1 1 A
2 2 B
> merge(df1, df2)
a b c
1 1 a A
2 2 b B

In R: Split a character vector to find specific characters and return a data frame

I want to be able to extract specific characters from a character vector in a data frame and return a new data frame. The information I want to extract is auditors remark on a specific company's income and balance sheet. My problem is that the auditors remarks are stored in vectors containing the different remarks. For instance:
vec = c("A C G H D E"). Since "A" %in% vec won't return TRUE, I have to use strsplit to break up each character vector in the data frame, hence "A" %in% unlist(strsplit(dat[i, 2], " "). This returns TRUE.
Here is a MWE:
dat <- data.frame(orgnr = c(1, 2, 3, 4), rat = as.character(c("A B C")))
dat$rat <- as.character(dat$rat)
dat[2, 2] <- as.character(c("A F H L H"))
dat[3, 2] <- as.character(c("H X L O"))
dat[4, 2] <- as.character(c("X Y Z A B C"))
Now, to extract information about every single letter in the rat coloumn, I've tried several approaches, following similar problems such as Roland's answer to a similar question (How to split a character vector into data frame?)
DF <- data.frame(do.call(rbind, strsplit(dat$rat, " ", fixed = TRUE)))
DF
X1 X2 X3 X4 X5 X6
1 A B C A B C
2 A F H L H A
3 H X L O H X
4 X Y Z A B C
This returnsthe following error message: Warning message:
In (function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 2)
It would be a desirable approach since it's fast, but I can't use DF since it recycles.
Is there a way to insert NA instead of the recycling because of the different length of the vectors?
So far I've found a solution to the problem by using for-loops in combination with ifelse-statements. However, with 3 mill obs. this approach takes years!
dat$A <- 0
for(i in seq(1, nrow(dat), 1)) {
print(i)
dat[i, 3] <- ifelse("A" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
}
dat$B <- 0
for(i in seq(1, nrow(dat), 1)) {
print(i)
dat[i, 4] <- ifelse("B" %in% unlist(strsplit(dat[i, 2], " ")), 1, 0)
}
This gives the results I want:
dat
orgnr rat A B
1 1 A B C 1 1
2 2 A F H L H 1 0
3 3 H X L O 0 0
4 4 X Y Z A B C 1 1
I've searched through most of the relevant questions I could find here on StackOverflow. This one is really close to my problem: How to convert a list consisting of vector of different lengths to a usable data frame in R?, but I don't know how to implement strsplit with that approach.
We can use for-loop with grepl to achieve this task. + 0 is to convert the column form TRUE or FALSE to 1 or 0
for (col in c("A", "B")){
dat[[col]] <- grepl(col, dat$rat) + 0
}
dat
# orgnr rat A B
# 1 1 A B C 1 1
# 2 2 A F H L H 1 0
# 3 3 H X L O 0 0
# 4 4 X Y Z A B C 1 1
If performance is an issue, try this data.table approach.
library(data.table)
# Convert to data.table
setDT(dat)
# Create a helper function
dummy_fun <- function(col, vec){
grepl(col, vec) + 0
}
# Apply the function to A and B
dat[, c("A", "B") := lapply(c("A", "B"), dummy_fun, vec = rat)]
dat
# orgnr rat A B
# 1: 1 A B C 1 1
# 2: 2 A F H L H 1 0
# 3: 3 H X L O 0 0
# 4: 4 X Y Z A B C 1 1
using Base R:
a=strsplit(dat$rat," ")
b=data.frame(x=rep(dat$orgnr,lengths(a)),y=unlist(a),z=1)
cbind(dat,as.data.frame.matrix(xtabs(z~x+y,b)))
orgnr rat A B C F H L O X Y Z
1 1 A B C 1 1 1 0 0 0 0 0 0 0
2 2 A F H L H 1 0 0 1 2 1 0 0 0 0
3 3 H X L O 0 0 0 0 1 1 1 1 0 0
4 4 X Y Z A B C 1 1 1 0 0 0 0 1 1 1
From here you can Just call those columns that you want:
d=as.data.frame.matrix(xtabs(z~x+y,b))
cbind(dat,d[c("A","B")])
orgnr rat A B
1 1 A B C 1 1
2 2 A F H L H 1 0
3 3 H X L O 0 0
4 4 X Y Z A B C 1 1

Convert a matrix to columns

Assuming I have a matrix looks like below, the values up or down the diagonal are the same. In other words, [,1] x [2,] and [,2] x [1,] both are 2 in the matrix.
> m = cbind(c(1,2,3),c(2,4,5),c(3,5,6))
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 4 5
[3,] 3 5 6
Then I have real name for 1, 2, and 3 as well.
>Real_name
A B C # A represents 1, B represents 2, and C represents 3.
If I would like to convert the matrix to 3 columns containing corresponding real name for each pair, and the pair must be unique, A x B is the same as B x A, so we keep A x B only. How can I achieve it using R?
A A 1
A B 2
A C 3
B B 4
B C 5
C C 6
The following is straightforward:
m <- cbind(c(1,2,3), c(2,4,5), c(3,5,6))
## read `?lower.tri` and try `v <- lower.tri(m, diag = TRUE)` to see what `v` is
## read `?which` and try `which(v, arr.ind = TRUE)` to see what it gives
ij <- which(lower.tri(m, diag = TRUE), arr.ind = TRUE)
Real_name <- LETTERS[1:3]
data.frame(row = Real_name[ij[, 1]], col = Real_name[ij[, 2]], val = c(m[ij]))
# row col val
#1 A A 1
#2 B A 2
#3 C A 3
#4 B B 4
#5 C B 5
#6 C C 6
colnames(m) <- c("A", "B", "C")
rownames(m) <- c("A", "B", "C")
m[lower.tri(m)] = NA # replace lower triangular elements with NA
data.table::melt(m, na.rm = TRUE) # melt and remove NA
# Var1 Var2 value
#1 A A 1
#4 A B 2
#5 B B 4
#7 A C 3
#8 B C 5
#9 C C 6
Or you can do it in a single line: melt(replace(m, lower.tri(m), NA), na.rm = TRUE)
This will also work:
g <- expand.grid(1:ncol(m), 1:ncol(m))
g <- g[g[,2]>=g[,1],]
cbind.data.frame(sapply(g, function(x) Real_name[x]), Val=m[as.matrix(g)])
Var1 Var2 Val
1 A A 1
2 A B 2
3 B B 4
4 A C 3
5 B C 5
6 C C 6

Replacing header in data frame based on values in second data frame

Say I have a data frame which looks like this:
df.A
A B C
x 1 3 4
y 5 4 6
z 8 9 1
And I want to replace the column names in the first based on column values in a second:
df.B
Low High
A D
B F
C G
Such that I get:
df.A
D F G
x 1 3 4
y 5 4 6
z 8 9 1
How would I do it?
I have tried extracting the vector df.B$High from df.B and using this in names(df.A), but everything is in alphabetical order and shifted over one. Furthermore, this only works if the order of columns in df.A is conserved with respect to the elements in df.B$High, which is not always the case (and in my real example there is no numeric or alphabetical way to sort the two to the same order). So I think I need an rbind-type argument for matching elements, but I'm not sure.
Thanks!
You can use rename from plyr:
library(plyr)
dat <- read.table(text = " A B C
x 1 3 4
y 5 4 6
z 8 9 1",header = TRUE,sep = "")
> new <- read.table(text = "Low High
A D
B F
C G",header = TRUE,sep = "")
> rename(dat,replace = setNames(new$High,new$Low))
D F G
x 1 3 4
y 5 4 6
z 8 9 1
using match:
df.A <- read.table(sep=" ", header=T, text="
A B C
x 1 3 4
y 5 4 6
z 8 9 1")
df.B <- read.table(sep=" ", header=T, text="
Low High
A D
B F
C G")
df.C <- df.A
names(df.C) <- df.B$High[match(names(df.A), df.B$Low)]
df.C
# D F G
# x 1 3 4
# y 5 4 6
# z 8 9 1
You can play games with the row names of df.B to make a lookup more convenient:
rownames(df.B) <- df.B$Low
names(df.A) <- df.B[names(df.A),"High"]
df.A
## D F G
## x 1 3 4
## y 5 4 6
## z 8 9 1
Here's an approach abusing factor:
f <- factor(names(df.A), levels=df.B$Low)
levels(f) <- df.B$High
f
## [1] D F G
## Levels: D F G
names(df.A) <- f
## Desired results

Resources