Pattern matching character vectors in R - r

I am trying to match the characters between two vectors in two separate dataframes, lets call the dataframes "rentals" and "parcels", which both contain the vector "address" which is a character of the addresses of all rental parcels in a county and the addresses of all parcels in a city. We would like to figure out which addresses in the "parcels" dataframe match an address in the "rentals" dataframe by searching through the vector of addresses in "parcels" for matches with an address in "rentals."
The values in rentals$address look like this:
rentals$address <- c("110 SW ARTHUR ST", "1610 NE 66TH AVE", "1420 SE 16TH AVE",...)
And the values in parcels$address look like this:
parcels$address <- c("635 N MARINE DR, PORTLAND, OR, 97217", "7023 N BANK ST, PORTLAND, OR, 97203", "5410 N CECELIA ST, PORTLAND, OR, 97203",...)
There are about 172,000 entries in the "parcels" dataframe and 285 in the "rentals" dataframe. My first solution was to match character values using grepl, which I don't think worked:
matches = grepl(rentals$address, parcels$address, fixed = TRUE)
This returns FALSE for each entry in parcels$address, but copying and pasting some values of "address" from "rentals" into Excel's CNTRL+F window viewing the "parcels" dataframe, I see a few addresses. So some appear to match.
How would I best be able to find which observation's values in the "address" column of the "rentals" dataframe is a matching character sequence in the "parcels" dataframe?

Are the addresses all exact matches? That is, no variations in spacing, capitalization, apartment number? If so, you might be able to use the dplyr function left_join to create a new df, using the address as the key, like so
library(dplyr)
df_compare <- df_rentals %>%
left_join(df_parcels, by = "address")
additionally, if you have columns along the lines of df_rentals$rentals = yes and df_parcels$parcels = yes, you can filter the resulting new dataframe
df_both <- filter(df_compare, rentals == "yes", parcels == "yes")

Related

Using R, How to use a character vector to search for matches in a very large character vector

I have a vector of city names:
Cities <- c("New York", "San Francisco", "Austin")
And want to use it to find records in a 1,000,000+ element column of city/state names contained in a bigger table that match any of the items in the Cities vector
Locations<- c("San Antonio/TX","Austin/TX", "Boston/MA")
Tried using lapply and grep but it kept saying it can’t use an input vector dimension larger than 1.
Ideally want to return the row positions in the Locations vector that contain any item in the Cities vector that will allow me to select matching rows in the broader table.
grep and family only allow a single pattern= in their call, but one can use Vectorize to help with this:
out <- Vectorize(grepl, vectorize.args = "pattern")(Cities, Locations)
rownames(out) <- Locations
out
# New York San Francisco Austin
# San Antonio/TX FALSE FALSE FALSE
# Austin/TX FALSE FALSE TRUE
# Boston/MA FALSE FALSE FALSE
(I added rownames(.) purely to identify columns/rows from the source data.)
With this, if you want to know which index points where, then you can do
apply(out, 1, function(z) which(z)[1])
# San Antonio/TX Austin/TX Boston/MA
# NA 3 NA
apply(out, 2, function(z) which(z)[1])
# New York San Francisco Austin
# NA NA 2
The first indicates the index within Cities that apply to each specific location. The second indicates the index within Locations that apply to each of Cities. Both of these methods assume that there is at most a 1-to-1 matching; if there are ever more, the which(z)[1] will hide the 2nd and subsequent, which is likely not a good thing.

Geocoding in R using googleway

I have read Batch Geocoding with googleway R
I am attempting to geocode some addresses using googleway. I want the geocodes, address, and county returned back.
Using the answer linked to above I created the following function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
df<-as.data.frame(unlist(res[[x]]$results$address_components))
address<-paste(df[1,],df[2,],sep = " ")
city<-paste0(df[3,])
county<-paste0(df[4,])
state<-paste0(df[5,])
zip<-paste0(df[7,])
coordinates<-cbind(coordinates,address,city,county,state,zip)
coordinates<-as.data.frame(coordinates)
})
Then put it back together like so...
library(data.table)
done<-rbindlist(geocodes))
The issue is getting the address and county back out from the 'res' list. The answer linked to above pulls the address from the dataframe that was sent to google and assumes the list is in the right order and there are no multiple match results back from google (in my list there seems to be a couple). Point is, taking the addresses from one file and the coordinates from another seems rather reckless and since I need the county anyway, I need a way to pull it out of google's resulting list saved in 'res'.
The issue is that some addresses have more "types" than others which means referencing by row as I did above does not work.
I also tried including rbindlist inside the function to convert the sublist into a datatable and then pull out the fields but can't quite get it to work. The issue with this approach is that actual addresses are in a vector but the 'types' field which I would use to filter or select is in a sublist.
The best way I can describe it is like this -
list <- c(long address),c(short address), types(LIST(street number, route, county, etc.))
Obviously, I'm a beginner at this. I know there's a simpler way but I am just really struggling with lists and R seems to make extensive use of them.
Edit:
I definitely recognize that I cannot rbind the whole list. I need to pull specific elements out and bind just those. A big part of the problem, in my mind, is that I do not have a great handle on indexing and manipulating lists.
Here are some addresses to try - "301 Adams St, Friendship, WI 53934, USA" has an 7X3 "address components" and corresponding "types" list of 7. Compare that to "222 S Walnut St, Appleton, WI 45911, USA" which has an address components of 9X3 and "types" list of 9. The types list needs to be connected back to the address components matrix because the types list identifies what each row of the address components matrix contains.
Then there are more complexities introduced by imperfect matches. Try "211 Grand Avenue, Rothschild, WI, 54474" and you get 2 lists, one for east grand ave and one for west grand ave. Google seems to prefer the east since that's what comes out in the "formatted address." I don't really care which is used since the county will be the same for either. The "location" interestingly contains 2 sets of geocodes which, presumably, refer to the two matches. I think this complexity can be ignored since the location consisting of two coordinates is still stored as a 'double' (not a list!) so it should stack with the coordinates for the other addresses.
Edit: This should really work but I'm getting an error in the do.call(rbind,types) line of the function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
R says the "types" object is not a list so it can't rbind it. I tried coercing it to a list but still get the error. I checked using the following paired down function and found #294 is null. This halts the function. I get "over query limit" as an error but I am not over the query limit.
geocodes<-lapply(seq_along(res),function(x) {
types<-res[[x]]$results$address_components[[1]]$types
print(typeof(types))
})
Here's my solution using tidyverse functions. This gets the geocode and also the formatted address in case you want it (other components of the result can be returned as well, they just need to be added to the table in the last row of the map function that gets returned.
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(googleway))
set_key("your key here")
df <- tibble(full_address = c("2379 ADDISON BLVD HIGH POINT 27262",
"1751 W LEXINGTON AVE HIGH POINT 27262", "dljknbkjs"))
df %>%
mutate(geocode_result = map(full_address, function(full_address) {
res <- google_geocode(full_address)
if(res$status == "OK") {
geo <- geocode_coordinates(res) %>% as_tibble()
formatted_address <- geocode_address(res)
geocode <- bind_cols(geo, formatted_address = formatted_address)
}
else geocode <- tibble(lat = NA, lng = NA, formatted_address = NA)
return(geocode)
})) %>%
unnest()
#> # A tibble: 3 x 4
#> full_address lat lng formatted_address
#> <chr> <dbl> <dbl> <chr>
#> 1 2379 ADDISON BLVD HIGH POI… 36.0 -80.0 2379 Addison Blvd, High Point, N…
#> 2 1751 W LEXINGTON AVE HIGH … 36.0 -80.1 1751 W Lexington Ave, High Point…
#> 3 dljknbkjs NA NA <NA>
Created on 2019-04-14 by the reprex package (v0.2.1)
Ok, I'll answer it myself.
Begin with a dataframe of addresses. I called mine "addresses" and the singular column in the dataframe is also called "Addresses" (note that I capitalized it).
Use googleway to get the geocode data. I did this using apply to loop across the rows in the address dataframe
library(googleway)
res<-apply(addresses,1,function (x){
google_geocode(address=x[['Address']], key='insert your google api key here - its free to get')
})
Here is the function I wrote to get the nested lists into a dataframe.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
library(data.table)
geocodes<-rbindlist(geocodes,fill=TRUE)
lapply loops along the items in the list, within the function I create a coordinates dataframe and put the geocodes there. I also wanted the other address components, particularly the county, so I also created the "types" dataframe which identifies what the items in the address are. I cbind the address items with the types, then use spread from the tidyr package to reshape the dataframe into wideformat so it's just 1 row wide. I then cbind in the lat and lon from the coordinates dataframe.
The rbindlist stacks it all back together. You could use do.call(rbind, geocodes) but rbindlist is faster.

Fuzzy matching by category

I am trying to fuzzy match two different dataframes based on company names, using the agrep function. To improve my matching, I would like to only match companies if they are located in the same country.
df1: df2:
Company ISO Company ISO
Aalberts Industries NL Aalberts NL
Allison NL Allison transmission NL
Allison UK Allison transmission UK
I use the following function to match:
testb$test <- ""
for(i in 1:dim(testb)[1]) {x2 <- agrep(testb$name[i], testa$name, ignore.case=TRUE, value=TRUE, max.distance = Inf, useBytes = TRUE, fixed = TRUE)
x2 <- paste0(x2,"")
testb$test2[i] <- x2
}
I can create a subset for every country and than match each subset, which works, but is time consuming. Is there another way to let R only match company names if df1$ISO = df2$ISO? Thanks!
Try indexing with the data.table package: https://www.r-bloggers.com/intro-to-the-data-table-package/.
Your company columns seem to be too dissimilar to match consistently and accurately with agrep(). For example, "Aalberts Industries" will match "Aalberts" only when you set max.distance to a value greater than 10. The same string distance would also report a match between "Algebra" and "Alleyway" — not very close at all. I recommend cleaning out the unnecessary words in your company columns before matching.
Sorry, I would make this a comment, but I don't have the required reputation. Maybe someone could convert this to a comment for me?

Can't match two dataframe values

I am not sure why the dataframe values do not match with each other.
I have a df name fileUpload which looks like this (the cols are aligned correctly):
Destination City Year Adults
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 2
Amsterdam 2016 2
Amsterdam 2015 3
There is a space after each city name.
I have another dataframe that is not uploaded, like this:
cities <- read.csv(text = "
City,Lat,Long,Pop
Amsterdam ,4.8952,52.3702,779808
Bali ,115.1889,-8.4095,4225000")
I need to merge the two dataframes, but I realized that the city values returns not matching (NA). I tried checking it using fileUpload %in% cities returns false
I tried removing the space after the city, also did not work.
The typeof(df$city) for both is integer.
How can I make the cities name match together?
As pointed out in the comments you should convert your columns to strings from factors.
mergedCities <- merge(fileUpload, cities, by.x ="Destination City", by.y = "City", all = TRUE)
Set the all parameter to specify if you want to keep all cities or just the one form x or y or only the cities present in both.

Selecting strings and using in logical expressions to create new variable - R

I have a categorical variable indicating location of flu clinics as well as an "other" category. Participants who select the "other" category give open-ended responses for their location. In most cases, these open-ended responses fit with one of the existing categories (for example, one category is "public health clinic", but some respondents picked "other" and cited "mall" which was a public health clinic). I could easily do this by hand but want to learn the code to select "mall" strings then use logical expressions to assign these people to "public health clinic" (e.g. create a new variable for location of flu clinics).
My categorical variable is "lrecflu2" and my character string variable is "lfother"
So far I have:
mall <- grep("MALL", Motiv82012$lfother, value = TRUE)
This gives me a vector with all the string responses containing "MALL" (all strings are in caps in the dataframe)
How do I use this vector in a logical expression to create a new variable that assigns these people to the "public health clinic" category and assigns the original value of flu clinic location variable for people that did not select "other" (and do not have values in the character string variable) to the new flu clinic location variable?
Perhaps, grep is not even the right function to be using.
As I understand it, you have a column in a data frame, where you want to reassign one character value to another. If so, you were almost there...
set.seed(1) # for generating an example
df1 <- data.frame(flu2=sample(c("MALL","other","PHC"),size=10,replace=TRUE))
df1$flu2[grep("MALL",df1$flu2)] <- "PHC"
Here grep() is giving you the required vector index; you then subset the vector based on this and change those elements.
Update 2
This should produce a data.frame similar to the one you are using:
set.seed(1)
lreflu2 <- sample(c("PHC","Med","Work","other"),size=10,replace=TRUE)
Ifother <- rep("",10) # blank character vector
s1 <- c("Frontenac Mall","Kingston Mall","notMALL")
Ifother[lreflu2=="other"] <- s1
df1 <- data.frame(lreflu2,Ifother)
### alternative:
### df1 <- data.frame(lreflu2,Ifother, stringsAsFactors = FALSE)
df1
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 other Frontenac Mall
5 PHC
6 other Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
If you're looking for an exact string match you don't need grep at all:
df1$lreflu2[df1$Ifother=="MALL"] <- "PHC"
Using a regex:
df1$lreflu2[grep("Mall",df1$Ifother)] <- "PHC"
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 PHC Frontenac Mall
5 PHC
6 PHC Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
Whether Ifother is a factor or vector with mode character doesn't affect things. data.frame will coerce string vectors to factors by default.

Resources