`data.table` way to select subsets based on `agrep`? - r

I'm trying to convert from data.frame to data.table, and need some advice on some logical indexing I am trying to do on a single column. Here is a table I have:
places <- data.table(name=c('Brisbane', 'Sydney', 'Auckland',
'New Zealand', 'Australia'),
search=c('Brisbane AU Australia',
'Sydney AU Australia',
'Auckland NZ New Zealand',
'NZ New Zealand',
'AU Australia'))
# name search
# 1: Brisbane Brisbane AU Australia
# 2: Sydney Sydney AU Australia
# 3: Auckland Auckland NZ New Zealand
# 4: New Zealand NZ New Zealand
# 5: Australia AU Australia
setkey(places, search)
I want to extract rows whose search column matches all words in a list, like so:
words <- c('AU', 'Brisbane')
hits <- places
for (w in words) {
hits <- hits[search %like% w]
}
# I end up with the 'Brisbane AU Australia' row.
I have one question:
Is there a more data.table-way to do this? It seems to me that storing hits each time seems like a data.frame way to do this.
This is subject to the caveat that I eventually want to use agrep rather than grep/%like%:
words <- c('AU', 'Bisbane') # note the mis-spelling
hits <- places
for (w in words) {
hits <- hits[agrep(w, search)]
}
I feel like this doesn't quite take advantage of data.table's capabilities and would appreciate thoughts on how to modify the code so it does.
EDIT
I want the for loop because places is quite large, and I only want to find rows that match all the words. Hence I only need to search in the results for the last word for the next word (that is, successively refine the results).
With the talk of "binary scan" vs "vector scan" in the data.table introduction (i.e. "bad way" is DT[DT$x == "R" & DT$y == "h"], "good way" is setkey(DT, x, y); DT[J("R", "h")] I just wondered if there was some way I could apply this approach here.

Mathematical.coffee, as I mentioned under comments, you can not "partial match" by setting a column (or more columns) as key column(s). That is, in the data.table places, you've set the column "search" as the key column. Here, you can fast subset by using data.table's binary search (as opposed to vector scan subsetting) by doing:
places["Brisbane AU Australia"] # binary search when "search" column is key'd
# is faster compared to:
places[search == "Brisbane AU Australia"] # vector scan
But in your case, yo require:
places["AU"]
to give all rows with has a partial match of "AU" within the key column. And this is not possible (while it's certainly a very interesting feature to have).
If the substring you're searching for by itself does not contain mismatches, then you can try splitting the search strings into separate columns. That is, the column search if split into three columns containing Brisbane, AU and Australia, then you can set the key of the data.table to the columns that contain AU and Brisbane. Then, you can query the way you mention as:
# fast subset, AU and Brisbane are entries of the two key columns
places[J("AU", "Brisbane")]

You can vectorize the agrep function to avoid looping.
Note that the result of agrep2 is a list hence the unlist call
words <- c("Bisbane", "NZ")
agrep2 <- Vectorize(agrep, vectorize.args = "pattern")
places[unlist(agrep2(words, search))]
## name search
## 1: Brisbane Brisbane AU Australia
## 2: Auckland Auckland NZ New Zealand
## 3: New Zealand NZ New Zealand

Related

Using R, How to use a character vector to search for matches in a very large character vector

I have a vector of city names:
Cities <- c("New York", "San Francisco", "Austin")
And want to use it to find records in a 1,000,000+ element column of city/state names contained in a bigger table that match any of the items in the Cities vector
Locations<- c("San Antonio/TX","Austin/TX", "Boston/MA")
Tried using lapply and grep but it kept saying it can’t use an input vector dimension larger than 1.
Ideally want to return the row positions in the Locations vector that contain any item in the Cities vector that will allow me to select matching rows in the broader table.
grep and family only allow a single pattern= in their call, but one can use Vectorize to help with this:
out <- Vectorize(grepl, vectorize.args = "pattern")(Cities, Locations)
rownames(out) <- Locations
out
# New York San Francisco Austin
# San Antonio/TX FALSE FALSE FALSE
# Austin/TX FALSE FALSE TRUE
# Boston/MA FALSE FALSE FALSE
(I added rownames(.) purely to identify columns/rows from the source data.)
With this, if you want to know which index points where, then you can do
apply(out, 1, function(z) which(z)[1])
# San Antonio/TX Austin/TX Boston/MA
# NA 3 NA
apply(out, 2, function(z) which(z)[1])
# New York San Francisco Austin
# NA NA 2
The first indicates the index within Cities that apply to each specific location. The second indicates the index within Locations that apply to each of Cities. Both of these methods assume that there is at most a 1-to-1 matching; if there are ever more, the which(z)[1] will hide the 2nd and subsequent, which is likely not a good thing.

Make only numeric entries blank

I have a dataframe with UK postcodes in it. Unfortunately some of the postcode data is incorrect - ie, they are only numeric (all UK postcodes should start with a alphabet character)
I have done some research and found the grepl command that I've used to generate a TRUE/FALSE vector if the entry is only numeric,
Data$NewPostCode <- grepl("^.*[0-9]+[A-Za-z]+.*$|.*[A-Za-z]+[0-9]+.*$",Data$PostCode)
however, what I really want to do is where the instance starts with a number to make the postcode blank.
Note, I don't want remove the rows with an incorrect postcode as I will lose information from the other variables. I simply want to remove that postcode
Example data
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton 1254
London 1290C
Newcastle N1 3DC
Desired output
Area Postcode
Birmingham B1 1AA
Manchester M1 2BB
Bristol BS1 1LM
Southampton
London
Newcastle N1 3DC
There are a few ways to go between TRUE/FALSE vectors and the kind of task you want, but I prefer ifelse. A simpler way to generate the type of logical vector you're looking for is
grepl("^[0-9]", Data$PostCode)
which will be TRUE whenever PostCode starts with a number, and FALSE otherwise. You may need to adjust the regex if your needs are more complex.
You can then define a new column which is blank whenever the vector is TRUE and the old value whenever the vector is FALSE, as follows:
Data$NewPostCode <- ifelse(grepl("^[0-9]", Data$PostCode), "", Data$PostCode)
(May I suggest using NA instead of blank?)

Geocoding in R using googleway

I have read Batch Geocoding with googleway R
I am attempting to geocode some addresses using googleway. I want the geocodes, address, and county returned back.
Using the answer linked to above I created the following function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
df<-as.data.frame(unlist(res[[x]]$results$address_components))
address<-paste(df[1,],df[2,],sep = " ")
city<-paste0(df[3,])
county<-paste0(df[4,])
state<-paste0(df[5,])
zip<-paste0(df[7,])
coordinates<-cbind(coordinates,address,city,county,state,zip)
coordinates<-as.data.frame(coordinates)
})
Then put it back together like so...
library(data.table)
done<-rbindlist(geocodes))
The issue is getting the address and county back out from the 'res' list. The answer linked to above pulls the address from the dataframe that was sent to google and assumes the list is in the right order and there are no multiple match results back from google (in my list there seems to be a couple). Point is, taking the addresses from one file and the coordinates from another seems rather reckless and since I need the county anyway, I need a way to pull it out of google's resulting list saved in 'res'.
The issue is that some addresses have more "types" than others which means referencing by row as I did above does not work.
I also tried including rbindlist inside the function to convert the sublist into a datatable and then pull out the fields but can't quite get it to work. The issue with this approach is that actual addresses are in a vector but the 'types' field which I would use to filter or select is in a sublist.
The best way I can describe it is like this -
list <- c(long address),c(short address), types(LIST(street number, route, county, etc.))
Obviously, I'm a beginner at this. I know there's a simpler way but I am just really struggling with lists and R seems to make extensive use of them.
Edit:
I definitely recognize that I cannot rbind the whole list. I need to pull specific elements out and bind just those. A big part of the problem, in my mind, is that I do not have a great handle on indexing and manipulating lists.
Here are some addresses to try - "301 Adams St, Friendship, WI 53934, USA" has an 7X3 "address components" and corresponding "types" list of 7. Compare that to "222 S Walnut St, Appleton, WI 45911, USA" which has an address components of 9X3 and "types" list of 9. The types list needs to be connected back to the address components matrix because the types list identifies what each row of the address components matrix contains.
Then there are more complexities introduced by imperfect matches. Try "211 Grand Avenue, Rothschild, WI, 54474" and you get 2 lists, one for east grand ave and one for west grand ave. Google seems to prefer the east since that's what comes out in the "formatted address." I don't really care which is used since the county will be the same for either. The "location" interestingly contains 2 sets of geocodes which, presumably, refer to the two matches. I think this complexity can be ignored since the location consisting of two coordinates is still stored as a 'double' (not a list!) so it should stack with the coordinates for the other addresses.
Edit: This should really work but I'm getting an error in the do.call(rbind,types) line of the function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
R says the "types" object is not a list so it can't rbind it. I tried coercing it to a list but still get the error. I checked using the following paired down function and found #294 is null. This halts the function. I get "over query limit" as an error but I am not over the query limit.
geocodes<-lapply(seq_along(res),function(x) {
types<-res[[x]]$results$address_components[[1]]$types
print(typeof(types))
})
Here's my solution using tidyverse functions. This gets the geocode and also the formatted address in case you want it (other components of the result can be returned as well, they just need to be added to the table in the last row of the map function that gets returned.
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(googleway))
set_key("your key here")
df <- tibble(full_address = c("2379 ADDISON BLVD HIGH POINT 27262",
"1751 W LEXINGTON AVE HIGH POINT 27262", "dljknbkjs"))
df %>%
mutate(geocode_result = map(full_address, function(full_address) {
res <- google_geocode(full_address)
if(res$status == "OK") {
geo <- geocode_coordinates(res) %>% as_tibble()
formatted_address <- geocode_address(res)
geocode <- bind_cols(geo, formatted_address = formatted_address)
}
else geocode <- tibble(lat = NA, lng = NA, formatted_address = NA)
return(geocode)
})) %>%
unnest()
#> # A tibble: 3 x 4
#> full_address lat lng formatted_address
#> <chr> <dbl> <dbl> <chr>
#> 1 2379 ADDISON BLVD HIGH POI… 36.0 -80.0 2379 Addison Blvd, High Point, N…
#> 2 1751 W LEXINGTON AVE HIGH … 36.0 -80.1 1751 W Lexington Ave, High Point…
#> 3 dljknbkjs NA NA <NA>
Created on 2019-04-14 by the reprex package (v0.2.1)
Ok, I'll answer it myself.
Begin with a dataframe of addresses. I called mine "addresses" and the singular column in the dataframe is also called "Addresses" (note that I capitalized it).
Use googleway to get the geocode data. I did this using apply to loop across the rows in the address dataframe
library(googleway)
res<-apply(addresses,1,function (x){
google_geocode(address=x[['Address']], key='insert your google api key here - its free to get')
})
Here is the function I wrote to get the nested lists into a dataframe.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
library(data.table)
geocodes<-rbindlist(geocodes,fill=TRUE)
lapply loops along the items in the list, within the function I create a coordinates dataframe and put the geocodes there. I also wanted the other address components, particularly the county, so I also created the "types" dataframe which identifies what the items in the address are. I cbind the address items with the types, then use spread from the tidyr package to reshape the dataframe into wideformat so it's just 1 row wide. I then cbind in the lat and lon from the coordinates dataframe.
The rbindlist stacks it all back together. You could use do.call(rbind, geocodes) but rbindlist is faster.

How to write a for-loop that searches names from data.frame in a character vector?

I have a data.frame with names of football players, for example:
names <- data.frame(id=c(1,2,3,4,5,6,7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'))
> names
id year
1 1 Maradona
2 2 Cruyff
3 3 Messi
4 4 Ronaldo
5 5 Pele
6 6 Van Basten
7 7 Diego
I also have a 6,000 scraped text files, containing stories about these football players. These stories are stored as 6,000 elements in a large vector called stories.
Is there a way a loop (or an apply function) can be written that searches for the names of each of the football players. If a match or multiple matches occur, I would like to record the element number and the name(s) of the football player.
For example, consider the following text in stories[1]:
Diego Armando Maradona (born 30 October 1960) is a retired Argentine
professional footballer. He has served as a manager and coach at other
clubs as well as the national team of Argentina. Many in the sport,
including football writers, former players, current players and
football fans, regard Maradona as the greatest football player of all
time. He was joint FIFA Player of the 20th Century
with Pele.
The ideal data.frame would have the following structure:
> outcome
element name1 name2
1 1 Maradona Pele
Does somebody know a way to write such a code that results in one data.frame for with information on all football players?
I just did it with a loop, but maybe you can do it with an apply function
#Make sure you include stringsAsFactors = F or my code won't work
football_names <- data.frame(id=c(1:7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'),stringsAsFactors = F)
outcome <- data.frame(element=football_names$id)
for (i in 1:nrow(football_names)){
names_in_story <- football_names$year[football_names$year %in% unlist(strsplit(stories[i],split=" "))]
for (j in 1:length(names_in_story)){
outcome[i,j+1] <- names_in_story[j]
}
}
names(outcome) <- c("element",paste0("name",1:(ncol(outcome)-1)))
I don't undertsand your question exactly. But you can try to use a string match using astringr function and lapply.
I assumed that your data stories is a list.
The function finds all names you provide into the function as a vector and counts their occurence. The output is again a list.
foo <- function(x,y) table(unlist(str_match_all(x,paste0(y,collapse = "|"))))
The result
res <- lapply(series, foo,names$year)
Then you can merge and sum up the data (rowSums()) for example like this:
Reduce(function(...) merge(..., all=T, by="Var1"), res)

R how to get data column to rows of first and second values

Apologies, I'm a novice but I don't seem to be able to find an answer to this question.
I've scraped tabular data from a web page. After some cleaning It appears in a single unnamed column.
[1] John
[2] Smith
[3] Tina
[4] Jordan
and so on.....
I'm obviously looking for the result of:
FirstName | LastName
[1] John Smith
[2] Tina Jordan
et al.
Much of what has gotten me to this point was sourced from: http://statistics.berkeley.edu/computing/r-reading-webpages
A very helpful resource for beginners such as myself.
I would be grateful for any advice you can give me.
Thanks,
C R Eaton
We create a logical index ('i1'), create a data.frame by extracting the elements in the first column of the original dataset ('dat') using 'i1'. The 'i1' elements will recycle to the length of the column, so if we do 'dat[i1,1]`, it will extract 1st element, 3rd, 5th, etc. For the last name, we just negate the 'i1', so that it will extract 2nd, 4th, etc..
i1 <- c(TRUE, FALSE)
d1 <- data.frame(FirstName = dat[i1,1], LastName = dat[!i1, 1], stringsAsFactors=FALSE)

Resources