If I have a data frame in R with two UK postcode fields (both in upper case), is there an easy, convenient way to define a score that is based on some kind of fuzzy comparison between these two fields?
Have done some googling but everything I found was some kind of "fuzzy join", and I don't need the join bit here.
So for example if I had:
my_df <- data.frame(postcode_1 = c("AB1 1AB", "DN17 2DF", "TN38 8LE", "FK1 2ZZ"),
postcode_2 = c("AB1 1AB", "EC1X 3WW", "TN38 9LE", "FK2 1ZZ"))
then I might like to do something like
my_df <- my_df |>
mutate(score = fuzzy_string_compare_thingy(postcode_1, postcode_2))
to give me (for example):
my_df <- data.frame(postcode_1 = c("AB1 1AB", "DN17 2DF", "TN38 8LE", "FK1 2ZZ"),
postcode_2 = c("AB1 1AB", "EC1X 3WW", "TN38 9LE", "FK2 1ZZ"),
score = c(1, 0.1, 0.9, 0.7))
(the values in the score field are made up of course)
Choose your desired distance method using e.g. stringdist
library(stringdist)
cbind(my_df,
sapply(c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw", "soundex" ), function(m)
apply(my_df, 1, function(x) stringdist(x[1], x[2], method=m))))
postcode_1 postcode_2 osa lv dl hamming lcs qgram cosine jaccard
1 AB1 1AB AB1 1AB 0 0 0 0 0 0 0.0000000 0.0000000
2 DN17 2DF EC1X 3WW 6 6 6 6 12 12 0.8000000 0.8333333
3 TN38 8LE TN38 9LE 1 1 1 1 2 2 0.1055728 0.1250000
4 FK1 2ZZ FK2 1ZZ 2 2 2 2 4 0 0.0000000 0.0000000
jw soundex
1 0.00000000 0
2 0.50000000 1
3 0.08333333 0
4 0.04761905 0
Using "jw", also "normalize" to 1 (meaning identity) with dplyr
library(dplyr)
my_df %>%
mutate(score_1 = 1 - stringdist(postcode_1, postcode_2, method="jw"))
postcode_1 postcode_2 score_1
1 AB1 1AB AB1 1AB 1.0000000
2 DN17 2DF EC1X 3WW 0.5000000
3 TN38 8LE TN38 9LE 0.9166667
4 FK1 2ZZ FK2 1ZZ 0.9523810
Related
I have this dataset (let's imagine it with 900 variables ) and the list l2 as follows :
df = data.frame(x = c(1,0,0,0,1,1,1), y = c(2,2,2,2,3,3,2) )
l1 = lapply(df,table)
l2 = lapply(l1,as.data.frame)
I wish to add a percentage column to each of these dataframes based on the Freq column of each dataframe. Appreciate the help.
You can use proportions:
lapply(df, \(x) transform(as.data.frame(table(x)), prop = proportions(Freq)))
$x
x Freq prop
1 0 3 0.4285714
2 1 4 0.5714286
$y
x Freq prop
1 2 5 0.7142857
2 3 2 0.2857143
We may use
proportions(table(data.frame(nm1 = names(df)[col(df)], val = unlist(df))), 1)
-output
val
nm1 0 1 2 3
x 0.4285714 0.5714286 0.0000000 0.0000000
y 0.0000000 0.0000000 0.7142857 0.2857143
I have a matrix that is 10 rows by 4 columns. Each row represents a user, and each column a measurement. Some users only have one measurement, while others may have the full 4 measurements.
The goals I want to accomplish with this matrix are three fold:
To subtract the user's measurements from their own measurements (across columns);
To subtract the user's measurement from other user's measurement points (all included, across rows);
To create a final matrix that counts the number of "matches" (comparisons) each user has against themselves and others.
Within a threshold of 2.0 units, I have tried to measure each user's measurement against their own measurement and other users by obtaining the difference with a nested for-loop.
Below is an example of what the clean_data matrix looks like, and this matrix was used for all three goals:
M1 M2 M3 M4
U1 148.2 148.4 155.6 155.7
U2 149.5 150.1 150.1 153.9
U3 148.4 154.2 NA NA
U4 154.5 NA NA NA
U5 151.1 156.9 157.1 NA
For Goal #3, the output should look something akin to this matrix:
U1 U2 U3 U4 U5
U1 2 8 4 2 3
U2 8 3 2 1 4
U3 4 2 0 1 0
U4 2 1 1 0 0
U5 3 4 0 0 1
For example: User 1 has 2 matches with themselves because, with all 4 of their measurements, 2 differences were less than a value of 2.0 units. User 1 also has 8 matches with User 2. Each of User 1's measurements were subtracted iteratively from User 2's measurements (stored as an absolute value), and those differences that were below a value of 2 were considered a "match."
I have tried using the following nested for-loop, however I believe it is only counting the number of elements in my matrix instead of adding the differences.
# Set the time_threshold.
time_threshold <- 2.000
# Create an empty matrix the same dimensions as the number of users present.
matrix_a<-matrix(nrow = nrow(clean_data), ncol = nrow(clean_data))
# Use a nested for-loop to calculate the intra-user
# and inter-user time differences, adding values below
# the threshold up for those user-comparisons.
for (i in 1:nrow(clean_data)) {
for (j in 1:nrow(clean_data)) {
matrix_a[i, j] <-
round(sum(!is.na(abs((clean_data[i, 2:dim(clean_data)[2]]) -
(clean_data[j, 2:dim(clean_data)[2]])
) <= time_threshold)) / 2)
}
}
# Dividing by 2 and rounding has proven that this code only counts the
# number of vectors that are not NA, not the values below by time_threshold (2.000).
Is there a way that can calculate the differences I outlined above, and is also more efficient than a nested for-loop?
Note: The structure of these data are only relevant in so far that differences can be calculated for individuals across rows and columns. Missing values in this example are represented as NA, and should not be included in the calculation. Alternatively, I have set them to -0.01, which still has not changed the outcome of my for-loop.
You could write a function to do the loop for you:
fun <- function(index, dat){
i <- index[1]
j <- index[2]
m <- if(i==j) combn(dat[i,],2, function(x)diff(x))
else do.call("-", expand.grid(dat[i, ], dat[j, ]))
sum(abs(m)<2, na.rm = TRUE)
}
dist_fun <- function(dat){
dat <- as.matrix(dat)
result <- diag(0, nrow(dat))
mat_index <- which(lower.tri(result, TRUE), TRUE)
result[mat_index] <- apply(mat_index, 1, fun, dat = dat)
result[mat_index[,2:1]] <- result[mat_index]
result
}
dist_fun(df)
[,1] [,2] [,3] [,4] [,5]
[1,] 2 8 4 2 4
[2,] 8 3 4 1 3
[3,] 4 4 0 1 0
[4,] 2 1 1 0 0
[5,] 4 3 0 0 1
Here's one tidyverse approach. I convert the data to longer format, then join it to itself by User (across) and by time point (down), each time counting the number of matches. Then I combine the two and convert to wide format again.
library(tidyverse)
my_data2 <- my_data %>% pivot_longer(-User)
left_join(my_data2, my_data2, by = "User") %>%
filter(name.x < name.y, abs(value.y - value.x) <= 2) %>% # EDIT
count(User) %>%
select(User.x = User, User.y = User, n) -> compare_across
my_data3 <- my_data2 %>% mutate(dummy = 1) # EDIT
inner_join(my_data3, my_data3, by = "dummy") %>% # EDIT
filter(abs(value.x - value.y) <=2, User.x != User.y) %>%
count(User.x, User.y) -> compare_down
bind_rows(compare_across, compare_down) %>%
arrange(User.x, User.y) %>%
pivot_wider(names_from = User.y, values_from = n, values_fill = list(n = 0))
# A tibble: 5 x 6
User.x U1 U2 U3 U4 U5
<chr> <int> <int> <int> <int> <int>
1 U1 2 8 4 2 4
2 U2 8 3 4 1 3
3 U3 4 4 0 1 0
4 U4 2 1 1 0 0
5 U5 4 3 0 0 1
source data:
my_data <- data.frame(
stringsAsFactors = FALSE,
User = c("U1", "U2", "U3", "U4", "U5"),
M1 = c(148.2, 149.5, 148.4, 154.5, 151.1),
M2 = c(148.4, 150.1, 154.2, NA, 156.9),
M3 = c(155.6, 150.1, NA, NA, 157.1),
M4 = c(155.7, 153.9, NA, NA, NA)
)
I've been braking my head whole morning how to do this.
So lets say this is my data set
set.seed(1)
temp <- as.data.frame(cbind(Key = letters[1:5], sapply(1:12, function(x) sample(c(0, 1), 5, replace = T))))
names(temp)[2:13] <- month.abb
temp
# Key Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 a 0 1 0 0 1 0 0 1 1 1 0 0
# 2 b 0 1 0 1 0 0 1 1 1 0 1 0
# 3 c 1 1 1 1 1 0 0 0 1 0 0 1
# 4 d 1 1 0 0 0 1 0 1 1 1 0 1
# 5 e 0 0 1 1 0 0 1 0 1 1 0 0
What I'm trying to do is to calculate the percentage of occurrences (1s) in two consecutive months.
For example, c and d had an occurrence in Jan. Both had occurrence in Feb too, so the output will be 1 for this month. In Feb, a-d had occurrences but only c had also an occurrence in Mar, so the the output will be .25 for that month, etc.
Desired output for that mini example:
data.frame(Month = month.abb[1:11], OverlapPercent = c(1, 1/4, 1, 1/3, 0, 0, 1/2, 1, 3/5, 0, 0))
# Month OverlapPercent
# 1 Jan 1.0000000
# 2 Feb 0.2500000
# 3 Mar 1.0000000
# 4 Apr 0.3333333
# 5 May 0.0000000
# 6 Jun 0.0000000
# 7 Jul 0.5000000
# 8 Aug 1.0000000
# 9 Sep 0.6000000
# 10 Oct 0.0000000
# 11 Nov 0.0000000
Was thinking to some how use rle for it, but not sure how to force it to stop on two occurences each time
Unless I'm missing something, the following looks valid:
#just to remove 'factor's from "temp"
tmp = do.call(cbind.data.frame, c(temp[1], lapply(temp[-1], function(x) as.numeric(as.character(x)))))
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(tmp[[i]] & tmp[[i+1]]) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
EDIT:
Out of curiosity I checked #Bathsheba 's "bitwise AND" speed and seems to be faster than the "logical AND":
#identical results
sapply(head(seq_len(ncol(tmp))[-1], -1),
function(i) sum(bitwAnd(tmp[[i]], tmp[[i+1]])) / sum(tmp[[i]]))
#[1] 1.0000000 0.2500000 1.0000000 0.3333333 0.0000000 0.0000000 0.5000000 1.0000000 0.6000000 0.0000000 0.0000000
#twice as fast
x1 = sample(0:1, 1e6, T); x2 = sample(0:1, 1e6, T)
identical(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1))
#[1] TRUE
microbenchmark(sum(x1 & x2) / sum(x1), sum(bitwAnd(x1, x2)) / sum(x1), times = 50)
#Unit: milliseconds
# expr min lq median uq max neval
# sum(x1 & x2)/sum(x1) 23.95648 25.32448 25.78471 26.56232 49.18491 50
# sum(bitwAnd(x1, x2))/sum(x1) 10.97982 11.07309 11.20237 13.00450 35.67963 50
First fix up temp so that the 0/1 columns are numeric rather than factor. Then apply overlap to each pair of columns:
temp[-1] <- lapply(temp[-1], function(x) as.numeric(as.character(x)))
overlap <- function(x, y) mean(y[x == 1])
data.frame(Month = month.abb[-12],
Overlap = sapply(2:12, function(i) overlap(temp[,i], temp[,i+1])))
The above is preferred as it keeps the independent parts of the solution separate; however, as an alternative we could omit the first line above (which fixes up the factors) and instead incorporate that into overlap like this:
overlap <- function(x, y) mean(as.numeric(as.character(y))[x == 1]
Note that the Overlaps are fractions (as per the output shown in the question) and not percents as the heading in the question suggests.
In pseudocode, represent each column as a binary number.
E.g. Jan = 0b00110 and Feb = 0b11110.
Your formula for Jan is then
Bitcount(Jan AND Feb) / Bitcount(Jan)
Where AND is the bitwise AND operator and Bitcount counts the number of 1 bits in the number. (I can supply a way of bit counting if you need it). Of course, the formula for other months is a trivial generalisation.
Obviously you'll need a branch for the denominator being zero: not well defined in your question.
length(which(!xor(data["Feb"],data["Mar"]) & data["Feb"]==1)) / length(which(data["Feb"]==1))
!xor is the negated exclusive or.
length(which(...)) gives the number of true values in a logical vector.
I have a data frame like the one below, but with a lot more rows
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
From df I want a data frame where the rows are the unique values of df$x2 and col1 is the proportion of 1s associated with each letter, and col2 is the count of each letter. So, my output would be
> getprops(df)
prop count
a .6666 3
b 0 1
c 0.5 2
I can think of some elaborate, dirty ways to do this, but I'm looking for something short and efficient. Thanks
I like #RicardoSaporta's solution (+1), but you can use ?prop.table as well:
> df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
> df
x1 x2
1 1 a
2 1 a
3 0 b
4 0 a
5 1 c
6 0 c
> tab <- table(df$x2, df$x1)
> tab
0 1
a 1 2
b 1 0
c 1 1
> ptab <- prop.table(tab, margin=1)
> ptab
0 1
a 0.3333333 0.6666667
b 1.0000000 0.0000000
c 0.5000000 0.5000000
> dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
> dframe
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
If you'd like, you can put this together into a single function:
getprops <- function(values, indicator){
tab <- table(values, indicator)
ptab <- prop.table(tab, margin=1)
dframe <- data.frame(values=rownames(tab), prop=ptab[,2], count=tab[,2])
return(dframe)
}
> getprops(values=df$x2, indicator=df$x2)
values prop count
a a 0.6666667 2
b b 0.0000000 0
c c 0.5000000 1
Try installing plyr and running
library(plyr)
df <- data.frame(x1=c(1, 1, 0, 0, 1, 0),
label=c("a", "a", "b", "a", "c", "c"))
ddply(df, .(label), summarize, prop = mean(x1), count = length(x1))
# label prop count
# 1 a 0.6666667 3
# 2 b 0.0000000 1
# 3 c 0.5000000 2
which under the hood applies a split/apply/combine method similar to this in base R:
do.call(rbind, lapply(split(df, df$x2),
with, list(prop = mean(x1),
count = length(x1))))
Here is a one-liner in data.table:
> DT[, list(props=sum(x1) / .N, count=.N), by=x2]
x2 props count
1: a 0.6666667 3
2: b 0.0000000 1
3: c 0.5000000 2
where DT <- data.table(df)
I am not sure if this does what you want.
df<-data.frame(x1=c(1,1,0,0,1,0),x2=c("a","a","b","a","c","c"))
ones <- with(df, aggregate(x1 ~ x2, FUN = sum))
count <- table(df$x2)
prop <- ones$x1 / count
df2 <- data.frame(prop, count)
df2
rownames(df2) <- df2[,3]
df2 <- df2[,c(2,4)]
colnames(df2) <- c('prop', 'count')
df2
prop count
a 0.6666667 3
b 0.0000000 1
c 0.5000000 2
Try using table
tbl <- table(df$x1, df$x2)
# a b c
# 0 1 1 1
# 1 2 0 1
tbl["1",] / colSums(tbl)
# a b c
# 0.6666667 0.0000000 0.5000000
For nice output use:
data.frame(proportions=tbl["1",] / colSums(tbl))
proportions
a 0.6666667
b 0.0000000
c 0.5000000
I have a set of counts from data with three dimensions:
df <- data.frame(type = c("A", "B", "B", "A", "A", "C", "B", "C"), group = c("Tp", "Tp", "Tp", "Tp", "Fc", "Fc", "Fc", "Fc"), size = c(10,20,30,40,10,20,30,40), count = c(1, 4, 2, 3, 2, 10, 2, 3))
type group size count
1 A Tp 10 1
2 B Tp 20 4
3 B Tp 30 2
4 A Tp 40 3
5 A Fc 10 2
6 C Fc 20 10
7 B Fc 30 2
8 C Fc 40 3
I would like to find the proportion that each count takes up but subset over both type and group dimensions. That is, for example, what is the farction of size 10's that are in group "Tp" and of type "A"?
I thought there might be a function that was like aggregate or something within the plyr package but would calculate data per row based on subsets but I can't seem to find it. My best effort is using apply:
df$prop <- apply(df, 1, function(x) as.numeric(x["count"])/sum(df[df$type==x["type"] & df$group==x["group"], "count"]))
type group size count prop
1 A Tp 10 1 0.2500000
2 B Tp 20 4 0.6666667
3 B Tp 30 2 0.3333333
4 A Tp 40 3 0.7500000
5 A Fc 10 2 1.0000000
6 C Fc 20 10 0.7692308
7 B Fc 30 2 1.0000000
8 C Fc 40 3 0.2307692
I just wondered if there is an easier way of doing this? If not I will write this up as a custom function.
Thanks.
Try:
transform(df, prop=count/ave(count, type, group, FUN=sum))
With plyr,
ddply(df, c("type","group"), mutate, prop = count/sum(count))
type group size count prop
1 A Fc 10 2 1.0000000
2 A Tp 10 1 0.2500000
3 A Tp 40 3 0.7500000
4 B Fc 30 2 1.0000000
5 B Tp 20 4 0.6666667
6 B Tp 30 2 0.3333333
7 C Fc 20 10 0.7692308
8 C Fc 40 3 0.2307692
The much better scaling and imo more intuitive data.table way:
library(data.table)
dt = data.table(df)
dt[, prop := count/sum(count), by = list(type, group)]