Fuzzy String Matching in R on numbers separated by hyphens

Fuzzy String Matching in R on numbers separated by hyphens - r

I am trying to match Cell Phone Tower IDs contained in one table with a master table of locations(in lat long) of Cell Phone Tower IDs. The format of IDs in the locations table are different from the ones in the first table and I am trying to use agrep() to do a fuzzy match. To give you an example, let's say the ID I am trying to match is:
x <- c("405-800-125-39883")
A sample of IDs located in the locations table:
y <- c("405-810-1802-19883", "405-810-2101-29883", "405-810-1401-31883",
"405-810-5005-49883","125-39883","405-810-660-39883")
I am then using agrep() with different combinations of max.distance:
agrep(x,y,max.distance=0.3,value=TRUE)
This returns:
[1] "405-810-1802-19883" "405-810-2101-29883" "405-810-1401-31883" "405-810-5005-49883"
[5] "405-810-660-39883"
Whereas the value that I am really after is "125-39883"
I have also tried the stringdist_join() function from the stringdist package and applied to the two data frames bby varying max_dist but with no success. Basically what I am looking for is a perfect match after the last hyphen and then macth on the number on the second last hyphen and so on. Is there any way of doing that?

You can vectorized agrep to be able to use all the values of y as the pattern.
Your aim is to look for the whole of y as a part of x. Thus your pattern should be y and not x
names(unlist(Vectorize(agrep)(y,x)))
[1] "125-39883"
Although we can use adist with the argument partial=TRUE so that it may do exactly what agrep does:
y[which.min(c(adist(y,x,partial = T)))]
[1] "125-39883"
If x is a vector and y is also a vector, you would rather use adist instead of agrep. All the arguments of agrep are contained in adist. Check ?adist for further details.
with your new question in the comments, you can do something like this:
w=adist(y,x,partial=T)
z=setNames(nchar(sub(".*?(M*)$","\\1",c(attr(adist(y,x,counts=T),"trafos")))),y)
names(which.max(z[which(min(w)==w)]))
[1] "126-39883"

Related

Select data in R that meet a condition and use a for loop on that condition

I have a problem with the selection of column in a dataframe using a for loop. I'm new to R so it's very possible that I missed something obvious, but I did not find anything that works for me.
I have a file with 20 climatic variable measured during 60 years in 399 differents places.
I have a line for each day, and my column are the 20 climatic variable for each place (with a number at the end of the name to identify the place where the measure was taken).
It looks like that :
Temperature_1 Rain_1 .....Temperature_399 Rain_399
Date 1
Date 2
...
I want to select the 20 column corresponding to one place, run some calculations on the variables, put the results in an empty 3D array I have created, then do the same for the next place until the last one.
My problem is that I don't know how to select the right columns automatically. I also have issues with the writing of the results in the array.
I tried to select the columns corresponding to one place using the numbers at the end of the name of the variables, but I don't think it is possible to change automatically the condition.
I also tried to use the position of the columns but I'm not doing it properly
This is my code :
#creation of an empty array
Indice_clim=array(NA,dim = c(60,8,399),dimnames=list(c(1959:2018),c("Huglin","CNI","HD","VHD","SHS","DoF","FreqLF","SLF"),c(1:399)))
#selection of the columns corresponding to the first place using "end with"
maille=select(donnees_SAFRAN,c(1:4),ends_with(".1",ignore.case = FALSE))
# another try using the columns position which I know is really badly done
for (j in seq(from=5, to=7984,by=20)){
paste0("maille",j-4)=select(donnees_SAFRAN,c(1:4),c(j:j+19))
}
#and the calculation on the selected columns, the "i loop" is working.
for(i in 1959:2018)temp=c(maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(((T_moy.1-10)+(T_max.1-10))/2)*1.03),
maille%>%filter(an==i,mois==9)%>%summarise(mean(T_min.1)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=30)),
maille%>%filter(an==i)%>%summarise(sum(T_max.1>=35)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1>=28)%>%summarise(sum(T_moy.1-28)),
maille%>%filter(an==i)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9)%>%summarise(sum(T_min.1<=0)),
maille%>%filter(an==i,mois==4|mois==5|mois==6|mois==7|mois==8|mois==9,T_moy.1<2)%>%summarise(sum(abs(2-T_moy.1))))
Indice_clim[[i-1958,,]]=as.numeric(temp)}
I would like to create a loop or something to do my calculation on each place and write the result in my array.
If you have any idea, I would very much appreciate it !

You can use the grep() function to look for each of the locations 1, 2, ..., 399 in the column names. If your big dataframe containing all the data is called df, then you could do this:
for (i in 1:399) {
selected_indices <- grep(paste0('_', i, '$'), colnames(df))
# do calculations on the selected columns
df[, selected_indices]
}
The for loop will automatically run through each location i from 1 through 399. The paste0() function concatenates '_' with the variable i and the dollar sign $ to create strings like "_1$", "_2$", ..., "_399$", which are then searched for using the grep() function in the column names of df. The '$' is used to specify that you want the patterns _1, _2, ... to appear at the end of the column names (it is a regular expression special character).
The grep() function uses the above regular expressions to returns the column indices required for each location. You can then extract the relevant portion of df and do whatever calculations you want.

Conditionally add character to a string

I am trying to add 0s into character strings, but only under certain conditions.
I have a vector of file names like such:
my.fl <- c("res_P1_R1.rds", "res_P2_R1.rds",
"res_P1_R19.rds", "res_P2_R2.rds",
"res_P10_R1.rds", "res_P10_R19.rds")
I want to sort(my.fl) so that the file names are ordered by the numbers following the P and R, but as it stands sorting results in this:
"res_P1_R1.rds" "res_P1_R19.rds" "res_P10_R1.rds" "res_P10_R19.rds" "res_P2_R1.rds" "res_P2_R2.rds"
To fix this I need to add 0s after P and R, but only when the following number ranges from 1-9, if the following number is > 9 I want to do nothing.
The result should be as follows:
"res_P01_R01.rds" "res_P01_R19.rds" "res_P10_R01.rds" "res_P10_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds"
and if I sort it, it is ordered as expected e.g.:
"res_P01_R01.rds" "res_P01_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds" "res_P10_R01.rds" "res_P10_R19.rds"
I can add 0s based on position, but since the required position changes my solution only works on a subset of the file names. I think this would be a common problem but I haven't managed to find an answer on SO (or anywhere), any help much appreciated.

You should be able to just use mixedsort from the gtools package which removes the need to insert zeroes.
my.fl <- c("res_P1_R1.rds", "res_P2_R1.rds",
"res_P1_R19.rds", "res_P2_R2.rds",
"res_P10_R1.rds", "res_P10_R19.rds")
library(gtools)
mixedsort(my.fl)
[1] "res_P1_R1.rds" "res_P1_R19.rds" "res_P2_R1.rds" "res_P2_R2.rds" "res_P10_R1.rds" "res_P10_R19.rds"
But if you do want to insert the zeroes you could use something like:
sort(gsub("(?<=\\D)(\\d{1})(?=\\D)", "0\\1", my.fl, perl = TRUE))
[1] "res_P01_R01.rds" "res_P01_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds" "res_P10_R01.rds" "res_P10_R19.rds"

Creating Sub Lists from A to Z from a Master List

Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.

It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop

Data Frame containing hyphens using R

I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?

Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)

Creating abstract patterns from string data in R

I am trying to find patterns in strings in R by assigning tokens to the locations. I have a vector of the following form. It is basically change of location by a person in one year. For example in first case One person is moving from London to New york to London to Beijing and to cleveland.
path <- c('Lon-NYC-Lon-Bei-Cle', 'Mos-NYC-Bei-Cle-San', 'Bei-Cle-Bei-NYC-San')
I am trying to look for generic abstract patterns. I want to create a variable called: 'pattern' which gives me A-B-A-C-D for Lon-NYC-Lon-Bei-Cle string, A-B-C-D-E for Mos-NYC-Bei-Cle-San, A-B-A-C-D for Bei-Cle-Bei-NYC-San.
pattern <- c('A-B-A-C-D', 'A-B-C-D-E', 'A-B-A-C-D)
Is there a way I can create this variable?

If you always have fewer than 26 unique values, you can use something like this
sapply(strsplit(path,"-"), function(x)
paste(LETTERS[factor(x, levels=unique(x))], collapse="-")
)
# [1] "A-B-A-C-D" "A-B-C-D-E" "A-B-A-C-D"
here we use strsplit() to find the different pieces and factor() to take care of identifying duplicate values. Then we use the numeric values underlying the factor to index into the set of upper-case letters