Jaro-Winkler's difference between packages - r

I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can anyone help me understand where the difference comes from? Example:
library(RecordLinkage)
library(stringdist)
jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667
I am assuming it has to do with the weights, and I know I am using the defaults on both. However, if someone with more experience could shed light on what's going on, I would really appreciate it. Thanks!
Documentation:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf

Tucked away in the documentation for stringdist is the following:
The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d − l · p · d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a penalty factor, which in the work of Winkler is often chosen 0.1.
However, in stringdist::stringdist, p = 0 by default. Hence:
1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"),
method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
In fact that value is hard-coded in the source of RecordLinkage::jarowinkler.

Related

How to calculate mean of multiple standard deviation in R

i have a little problem i have 10 standard deviation and 10 mean with normal distribution like this
N(5,1) , N(10,3), N(8,2) N(6,1), N(10,3), N(7,2), N(4,1), N(10,3), N(9,2), N(8,1).
if i search the mean of total mean in R the code is
c=cbind(c(5,10,8,6,10,7,4,10,9,8))
y=mean(c)
so how to calculate average of standard deviation, but this average doesnt like the formula average as always?
Not sure if I understand you correctly, but if you have a vector of standard deviations, you can also just calculate the mean.
So e.g.
my_sd = c(1.23, 4.53, 3.343)
mean(my_sd)
If your quesiton is about how to calculate a standard deviation, this can be done easily with the sd function.
Not sure what error appears in your console.
May it be because you lack a ) at the end of your cbind function?
As you can see below, I can calculate mean of c without issue.
> c <- cbind(c(5,10,8,6,10,7,4,10,9,8))
> y <- mean(c)
> y
[1] 7.7
I'm not sure what your objective is. But averaging the standard deviations obscures the relationships between the standard deviations and the associated means. For instance N(25, 2) and N(5, 5) would have much different summary statistics than N(25, 5) and N(5, 2). Even though the averages of the means and standard deviations would be the same. A better statistic could be the average of the Coeffficients of Variation of each of the distributions. So:
ms <- c(5,10,8,6,10,7,4,10,9,8)
sds <- c(1,3,2,1,3,2,1,3,2,1)
cvs <- sds/ms
[1] 0.2000000 0.3000000 0.2500000 0.1666667 0.3000000 0.2857143 0.2500000 0.3000000 0.2222222 0.1250000
meancvs <- mean(cvs)
[1] 0.2399603

Why am I getting the same result when using p.adjust for the Holm and Bonferroni methods?

I've run several regressions and I'm trying to see if any of the significant findings hold after a multiple comparison adjustment. I know that Bonferroni adjustment is more conservative than Holm adjustment, so I would expect the outputs to be different, however they weren't. One of the values varied slightly, but not as much as I would have expected given that one of these tests is supposed to be less conservative than the other. I tried another correction method and got similar results. Is there something wrong with my syntax or is this a normal result?
p.vect <- c(.003125, .008947)
p.adjust(p.vect, method = "bonferroni", n=80)
[1] 0.25000 0.71576
p.adjust(p.vect, method = "holm", n=80)
[1] 0.250000 0.706813
p.adjust(p.vect, method = "hochberg", n = 80)
[1] 0.250000 0.706813
Holm and Hochberg just don't differ from each other for length(p)==2.
Given that lp is length(na.omit(p)) (equal to 2 in this case), and p is the vector of probabilities, here's the code for method=="holm":
i <- seq_len(lp) ## (1,2)
o <- order(p) ## (1,2)
ro <- order(o) ## (1,2)
pmin(1, cummax((n + 1L - i) * p[o]))[ro] ## c(80,79)*p
and the code for method=="hochberg":
i <- lp:1L ## (2,1)
o <- order(p, decreasing = TRUE) ## (2,1)
ro <- order(o) ## (2,1)
pmin(1, cummin((n + 1L - i) * p[o]))[ro] ## c(80,79)*p[c(2,1)][c(2,1)]
If you step through the details you can see how they give the same answer for your case. (Holm uses cummax() on the sorted vector of adjusted probabilities, Hochberg uses cummin() on the reverse-sorted vector, but neither of these changes anything in this case.)
Bonferroni is pmin(1, n*p). In this case the only difference is a factor of 80/79 on the second element (Hochberg and Holm multiply by (n+1-i) = c(80,79), Bonferroni multiplies by n=79.)
You can print out the code by typing p.adjust by itself, or in a prettier form here

RecordLinkage package in R - add weight to individual linking variables

I'm following the excellent tutorial on RPubs which uses the magnificent RecordLinkage package. I'm applying this to my own data but I'll just use the tutorial to explain my problem.
In the two datasets for comparison there are a number of common fields used in the linkage:
patents <- patents[,c("seq", "firstname", "lastname", "city", "state", "organization")]
nsf <- nsf[, c("InvestigatorId", "FirstName", "LastName", "CityName", "StateCode", "Name")]
names(nsf) <- names(patents)
These fields are then compared using the compare.linkage() function:
a <- compare.linkage(nsf, patents, blockfld = c("state"), strcmp = T, exclude=c(1))
This creates a large RecLinkData object called 'a' that contains a bunch of comparison pairs.
The next step is calculating the M and U weights (agreement weights) using the expectation maximisation (EM) algorithm:
b <- emWeights(a, cutoff = 0.8)
I think this is basically creating an overall agreement weight which is a product of all the individual linking variables.
My question is how can I add importance for one of the individual linking variables?
So for example, I might know that the "lastname" field is reliable and accurate in both datasets, so if the lastname agreed exactly then to give this more weight in the overall agreement score.
Even some pointers on where to look would be helpful, I'm a bit lost on this and don't even know what to attempt in terms of code.
You can't input additional information to emWeights(), except maybe cutoff =, which accepts a single value or a vector with the same length as the number of attributes. So you can choose a high cutoff value for attributes you know to be accurate, so the number of random matches will be minimized.
Apart from that, the EM algorithm in RecordLinkage allows no further customization.
There is however the epiWeights() pendant which calculates weights between 0and 1 using an estimated error rate (default e= 0.01) and the average frequencies of values in each field (1/length(unique(all_values_in_a_field)). You can supply both to the function manually and this way tweaking the results.
Consider this example:
t1 <- data.frame(Vorname = c("Karl", "Fritz"), Name = c("Meister", "Schulz"), stringsAsFactors = F)
t2 <- data.frame(Vorname = c("Karl", "Fritz"), Name = c("Meister", "Schulze"), stringsAsFactors = F)
> epiWeights(linkage)$Wdata # e = 0.01
[1] 1.0000000 0.0000000 0.0000000 0.3855691
> epiWeights(linkage, e = c(0.01, 0.3)$Wdata
[1] 1.0000000 0.0000000 0.0000000 0.3120078
If you assume a higher error rate for field Nachname it gets lower weights.

How to exclude unwanted comparisons in two-way ANOVA in R

I have asked about this already on stats.exchange (original question), now I re-posted the same content here - hoping to get help from a wider population.
I would like to know the way to exclude all the unwanted pairs from the output generated from two-way ANOVA, so when there shows a significant result from summary(aov()), the post-hoc test won't give me any comparisons I don't want. Details as follows:
I have datTable contain proportion data under two factor site (four levels: A, B, C, D) and treatment(two levels: control and treated). Specifically, I want to do a pair-wise test among all the site under each same treatment (e.g. control-A VS. control-B, control-A VS.control-C, treated-A VS.treated-C, etc.), while excludes comparisons between different sites and different treatments(e.g., pairs such as control-A VS. treated-B, control-B VS. treated-C).
The data looks like this:
> datTable
site treatment proportion
A control 0.5000000
A control 0.4444444
A treated 0.1000000
A treated 0.4000000
B control 0.4444444
B control 0.4782609
B treated 0.0500000
B treated 0.3000000
C control 0.3214286
C control 0.4705882
C treated 0.1200000
C treated 0.4000000
D control 0.3928571
D control 0.4782609
D treated 0.4000000
D treated 0.4100000
I did a two-way ANOVA (also not sure whether to use within subject site/treatment or between subject site*treatment...), and summarised the results.
m1 <- aov(proportion~site*treatment,data=datTable) # Or should I use 'site/treatment'?
Then my summary(m1) gave me the following:
> summary(m1)
Df Sum Sq Mean Sq F value Pr(>F)
site 3 0.02548 0.00849 0.513 0.6845
treatment 1 0.11395 0.11395 6.886 0.0305 *
site:treatment 3 0.03686 0.01229 0.742 0.5561
Residuals 8 0.13239 0.01655
Next step is to use TukeyHSD post-hoc test to see actually which pair caused the * significance in site factor.
> TukeyHSD(m1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = proportion ~ site * treatment, data = datTable)
$site
diff lwr upr p adj
B-A -0.042934783 -0.3342280 0.2483585 0.9631797
C-A -0.033106909 -0.3244002 0.2581863 0.9823452
D-A 0.059168392 -0.2321249 0.3504616 0.9124774
C-B 0.009827873 -0.2814654 0.3011211 0.9995090
D-B 0.102103175 -0.1891901 0.3933964 0.6869754
D-C 0.092275301 -0.1990179 0.3835685 0.7461309
$treatment
diff lwr upr p adj
treated-control -0.1687856 -0.3171079 -0.02046328 0.0304535
$`site:treatment`
diff lwr upr p adj
B:control-A:control -0.010869565 -0.5199109 0.4981718 1.0000000
C:control-A:control -0.076213819 -0.5852551 0.4328275 0.9979611
D:control-A:control -0.036663216 -0.5457045 0.4723781 0.9999828
A:treated-A:control -0.222222222 -0.7312635 0.2868191 0.6749021
B:treated-A:control -0.297222222 -0.8062635 0.2118191 0.3863364 # Not wanted
C:treated-A:control -0.212222222 -0.7212635 0.2968191 0.7154690 # Not wanted
D:treated-A:control -0.067222222 -0.5762635 0.4418191 0.9990671 # Not wanted
C:control-B:control -0.065344254 -0.5743856 0.4436971 0.9992203
D:control-B:control -0.025793651 -0.5348350 0.4832477 0.9999985
A:treated-B:control -0.211352657 -0.7203940 0.2976887 0.7189552 # Not wanted
B:treated-B:control -0.286352657 -0.7953940 0.2226887 0.4233804 # Not wanted
C:treated-B:control -0.201352657 -0.7103940 0.3076887 0.7583437 # Not wanted
D:treated-B:control -0.056352657 -0.5653940 0.4526887 0.9996991
D:control-C:control 0.039550603 -0.4694907 0.5485919 0.9999713
A:treated-C:control -0.146008403 -0.6550497 0.3630329 0.9304819 # Not wanted
B:treated-C:control -0.221008403 -0.7300497 0.2880329 0.6798628 # Not wanted
C:treated-C:control -0.136008403 -0.6450497 0.3730329 0.9499131
D:treated-C:control 0.008991597 -0.5000497 0.5180329 1.0000000 # Not wanted
A:treated-D:control -0.185559006 -0.6946003 0.3234823 0.8168230 # Not wanted
B:treated-D:control -0.260559006 -0.7696003 0.2484823 0.5194129 # Not wanted
C:treated-D:control -0.175559006 -0.6846003 0.3334823 0.8505865 # Not wanted
D:treated-D:control -0.030559006 -0.5396003 0.4784823 0.9999950
B:treated-A:treated -0.075000000 -0.5840413 0.4340413 0.9981528
C:treated-A:treated 0.010000000 -0.4990413 0.5190413 1.0000000
D:treated-A:treated 0.155000000 -0.3540413 0.6640413 0.9096378
C:treated-B:treated 0.085000000 -0.4240413 0.5940413 0.9960560
D:treated-B:treated 0.230000000 -0.2790413 0.7390413 0.6429921
D:treated-C:treated 0.145000000 -0.3640413 0.6540413 0.9326207
However, there are some pairs I don't want to be included in the two-way ANOVA which I preformed, specified as # not wanted.
Is there any way that I can tweak the aov or TukeyHSD function to exclude those possibilities ('not wanted' ones) I listed above? I could easily select the significant entires that I am interested (with *) from the long list produced from TukeyHSD. But I don't want my result from anova to be biased by those! (It happens in the real data that the significance actually caused by those unwanted pairs!)
NB: You might have noticed that the site:treatment post-hoc tests doesn't show any significance, this is because I only selected a small sample from the original data.
If you mean to exclude those comparisons completely from the calculations, Tukey's test works by doing pairwise comparisons for all combinations of conditions. It doesn't make sense to "exclude" any pairs.
If you mean you want to exclude the unwanted comparisons from showing in your final results then yes, it is possible. The result of TukeyHSD is simply a list and site:treatment is simply a matrix which you can manipulate as you like.
lst <- TukeyHSD(m1)
lst[['site:treatment']] <- lst[['site:treatment']][-c(5,6,7,10,11,12,15,16,18,19,20,21),]

better way to calculate euclidean distance with R

I am trying to calculate euclidean distance for Iris dataset. Basically I want to calculate distance between each pair of objects. I have a code working as follows:
for (i in 1:iris_column){
for (j in 1:iris_row) {
m[i,j] <- sqrt((iris[i,1]-iris[j,1])^2+
(iris[i,2]-iris[j,2])^2+
(iris[i,3]-iris[j,3])^2+
(iris[i,4]-iris[j,4])^2)
}
}
Although this works, I don't think this is a good way to wring R-style code. I know that R has built-in function to calculate Euclidean function. Without using built-in function, I want to know better code (faster and fewer lines) which could do the same as my code.
The part inside the loop can be written as
m[i, j] = sqrt(sum((iris[i, ] - iris[j, ]) ^ 2))
I’d keep the nested loop, nothing wrong with that here.
Or stay with the standard package stats:
m <- dist(iris[,1:4]))
This gives you an object of the class dist, which stores the lower triangle (all you need) compactly. You can get an ordinary full symmetric matrix if, e.g., you like to look at some elements:
> as.matrix(m)[1:5,1:5]
1 2 3 4 5
1 0.0000000 0.5385165 0.509902 0.6480741 0.1414214
2 0.5385165 0.0000000 0.300000 0.3316625 0.6082763
3 0.5099020 0.3000000 0.000000 0.2449490 0.5099020
4 0.6480741 0.3316625 0.244949 0.0000000 0.6480741
5 0.1414214 0.6082763 0.509902 0.6480741 0.0000000

Resources