Recognize/differentiate two sentences by R - r

Here is an example of my data
id address
Table1:User table
id address
1 mont carlo road,CA
2 mont road,IS
3 mont carlo road1-11,CA
Table 2(The output I wanna get)
Similarity Matrix
id 1 2 3
1
2 3
3 1 3
1~3 very similar~very dissimilar
My problem is how to recognize the similarity between the case by address in the Table 1, and then output a result, say Similarity Matrix like Table 2 in R. The point is how to figure out the comparison between two sentences in R and then set a scale to measure the similarity between a pair, finally output a matrix.

I'd also use the stringdist package but would make use of outer and cut to finish the job:
library(stringdist)
dat <- data.frame(
address = c("mont carlo road,CA", "mont road,IS", "mont carlo road1-11,CA"),
id = 1:3
)
m <- outer(dat[["address"]], dat[["address"]], stringdist, method="jw")
m[lower.tri(m)] <- cut(m[lower.tri(m)], 3, labels=1:3)
m[upper.tri(m)] <- cut(m[upper.tri(m)], 3, labels=1:3)
dimnames(m) <- list(dat[["id"]], dat[["id"]])
diag(m) <- NA
m
## 1 2 3
## 1 NA 3 1
## 2 3 NA 3
## 3 1 3 NA
You can use whatever method you want for calculating distance (?stringdist).

You might be interested in the Levenshtein Distance implemented in the R package stringdist. For example:
library(stringdist)
address <- c("mont carlo road,CA", "mont road,IS", "mont carlo road1-11,CA")
stringdist(address[1], address[2], method="lv")
[1] 8
You could then tailor these results to a matrix or whatever output you desire

Related

R For loop replace previsouly assigned values

I'd like to use uniform distribution to randomly assign value 1 or 2 for five groups(generate 5 random uniform distribution), with each group containing 10 samples.
I try to write:
for(i in 1:5){
rf <- runif(10)
result[rf<=0.5]=1
result[rf>0.5]=2
}
However this will replace the previously assigned values when the loop goes on.
The code produces only 10 results:
1 2 1 2 2 1 1 1 2 1
But I want a total of 50 randomized values:
1 2 1 2 ...... 2 1 1
How to do this? Thank you
Since, you are working on random number generated from same distribution every time, you can better generate 50 numbers in once, and assign value using ifelse function.
Try this:
a <- ifelse(runif(50) <= 0.5, 1, 2)
dim(a) <- c(10,5) #if result in matrix
To add to Gregor Thomas' advice, sample... You can also covert the stream into a matrix of 5 columns (groups) of 10.
nums <- sample(1:2, 50, replace = TRUE)
groups <- matrix(nums, ncol = 5)

How to find the index of the value sampled?

In R, I would like to know how I can find the index/indices of the value(s) sampled, for examaple using function sample.
In Matlab, it appears this is quite easily done by requesting output argument idx in function datasample. Explictly, taken from Matlab's documentation page for function datasample:
[y,idx] = datasample(data,k,...) returns an index vector indicating
which values datasample sampled from data.
I would like to know if such a thing can be accomplished in R, and how.
Example:
set.seed(12)
sample(c(0.3,78,45,0.8,0.3,0.8,77), size=1, replace=TRUE)
0.3
How can I know which of the two 0.3's was that one?
We can created a named vector and then sample
v1 <- c(LETTERS[1:10], LETTERS[1])
names(v1) <- seq_along(v1)
v2 <- sample(v1, 20, replace=TRUE)
as.integer(names(v2))
#[1] 10 11 4 2 1 4 6 9 1 1 2 9 2 2 2 3 4 7 3 6
Using the OP's data
set.seed(12)
v1 <- c(0.3,78,45,0.8,0.3,0.8,77)
names(v1) <- seq_along(v1)
set.seed(12)
sample(v1, size=1, replace=TRUE)
# 1
#0.3

Resample with replacement by cluster

I want to draw clusters (defined by the variable id) with replacement from a dataset, and in contrast to previously answered questions, I want clusters that are chosen K times to have each observation repeated K times. That is, I'm doing cluster bootstrapping.
For example, the following samples id=1 twice, but repeats the observations for id=1 only once in the new dataset s. I want all observations from id=1 to appear twice.
f <- data.frame(id=c(1, 1, 2, 2, 2, 3, 3), X=rnorm(7))
set.seed(451)
new.ids <- sample(unique(f$id), replace=TRUE)
s <- f[f$id %in% new.ids, ]
One option would be to lapply over each new.id and save it in a list. Then you can stack that all together:
library(data.table)
rbindlist(lapply(new.ids, function(x) f[f$id %in% x,]))
# id X
#1: 1 1.20118333
#2: 1 -0.01280538
#3: 1 1.20118333
#4: 1 -0.01280538
#5: 3 -0.07302158
#6: 3 -1.26409125
Just in case one would need to have a "new_id" that corresponded to the index number (i.e. sample order) -- (I needed to have "new_id" so that i could run mixed effects models without having several instances of a cluster treated as one cluster because they shared the same id):
library(data.table)
f = data.frame( id=c(1,1,2,2,2,3,3), X = rnorm(7) )
set.seed(451); new.ids = sample( unique(f$id), replace=TRUE )
## ss has unique valued `new_id` for each cluster
ss = rbindlist(mapply(function(x, index) cbind(f[f$id %in% x,], new_id=index),
new.ids,
seq_along(new.ids),
SIMPLIFY=FALSE
))
ss
which gives:
> ss
id X new_id
1: 1 -0.3491670 1
2: 1 1.3676636 1
3: 1 -0.3491670 2
4: 1 1.3676636 2
5: 3 0.9051575 3
6: 3 -0.5082386 3
Note the values of X are different because set.seed is not set before the rnorm() call, but the id is the same as the answer of #Mike H.
This link was useful to me in constructing this answer: R lapply statement with index [duplicate]

Creating combination of sequences

I am trying to solve following problem:
Consider 5 simple sequences: 0:100, 100:0, rep(0,101), rep(50,101), rep(100,101)
I need sets of 3 numeric variables, which have above sequences in all combinations. Since there are 5 sequences and 3 variables, there can be 5*5*5 combinations, hence total of 12625 (5*5*5*101) numbers in each variable (101 for each sequence).
These can be grouped in a data.frame of 12625 rows and 4 columns. First column (V) will simply have seq(1:12625) (rownumbers can be used in its place). Other 3 columns (A,B,C) will have above 5 sequences in different combinations. For example, the first 101 rows will have 0:100 in all 3 A,B and C. Next 101 rows will have 0:100 in A and B, and 100:0 in C. And so on...
I can create sequences as:
s = list()
s[[1]] = 0:100
s[[2]] = 100:0
s[[3]] = rep(0,101)
s[[4]] = rep(50,101)
s[[5]] = rep(100,101)
But how to proceed further? I do not really need the data frame but I need a function that returns a list containing the values of c(A,B,C) for the number (first or V column) sent to it. The number can obviously vary from 1 to 12625.
How can I create such a function. I will prefer a vector solution or one using apply family functions to optimize the speed.
You asked for a vectorized solution, so here's one using only data.table (similar to #SimonGs methodology)
library(data.table)
grd <- CJ(A = seq_len(5), B = seq_len(5), C = seq_len(5))
res <- grd[, lapply(.SD, function(x) unlist(s[x]))]
res
# A B C
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 2
# 4: 3 3 3
# 5: 4 4 4
# ---
# 12621: 100 100 100
# 12622: 100 100 100
# 12623: 100 100 100
# 12624: 100 100 100
# 12625: 100 100 100
I came up with two solutions. I find this hard to do with apply and the likes since they tend to give an output that is not so nice to handle (maybe someone can "tame" them better than I can :D)
First solution uses seperate calls to lapply, second one uses a for loop and some programming No-No's. Personally I prefer the second one, first one is faster though...
grd <- expand.grid(a=1:5,b=1:5,c=1:5)
# apply-ish
A <- lapply(grd[,1], function(z){ s[[z]] })
B <- lapply(grd[,2], function(z){ s[[z]] })
C <- lapply(grd[,3], function(z){ s[[z]] })
dfr <- data.frame(A=do.call(c,A), B=do.call(c,B), C=do.call(c,C))
# for-ish
mat <- NULL
for(i in 1:nrow(grd)){
cur <- grd[i,]
tmp <- cbind(s[[cur[,1]]],s[[cur[,2]]],s[[cur[,3]]])
mat <- rbind(mat,tmp)
}
The output of both dfr and mat seem to be what you describe.
Cheers!

Returning both halves of a distance matrix in R

I have an external distance object (gdis) and need to convert it to a matrix which contains both halves of the distance matrix. I am currently using the print function to achieve this. It is rather clunky having to print out the entire matrix each time a run the script. Is there any way to avoid this?
wgdis<-print(gdis, upper=TRUE)
m <- matrix(rpois(50,5), nrow=5)
m2 <- dist(m)
attr(m2, "Upper") <- TRUE
m2
1 2 3 4 5
1 12.083046 11.135529 11.958261 12.529964
2 12.083046 6.928203 10.148892 8.062258
3 11.135529 6.928203 8.660254 9.643651
4 11.958261 10.148892 8.660254 14.491377
5 12.529964 8.062258 9.643651 14.491377

Resources