Creating a function in R - r

I'm trying to create code that looks at two CSV files: one is a world list of all bird species and their ranges, and the other is a file of all the birds in the Himalayas. I need to check each species in the CSV file with the matching species on the IOC world list one and see if the bird is actually in range (meaning it would say either "India" or "himalayas" or "s e Asia" under the Range column). I want to create a function that can input both data sets, find where names match, check if range contains those words and returns where it does NOT, so I can check those birds specifically. Here is what I have so far (I'm using RStudio):
myfunc <- function() {
if ((bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")) {
print(eval(bird_data$Common.Name[bird_data$Scientific.name == ioc$Scientific.name) & (ioc$Scientific.name!=("Himalayas" | "se Asia" | "India")]))
}
}
save("myfunc", file = "myfunc.Rdata")
source("myfunc.Rdata")
I think I'm messed up in not having inputs. So I'm trying a new approach with:
compare = function(data1, data2) {
....
}
But for the above, I don't know how to let the function recognize the appropriate subsets of data (like I can't say data1$Scientific.name).

It's difficult to answer this question without a minimal reproducible example - without any knowledge of the two dataframes you are comparing it is hard to formulate a solution - see the link in the comment by alistaire above for how to provide this.
I suggest you change your question title to make it more informative - "Creating a function in R" suggests you want to know the syntax required for a function in R - I would recommend "Subsetting a dataframe with Grep and then filtering results in R" - which is what I think you are actually trying to do.
Assuming you obtained your IOC world list data from the International Ornithological Committee website I am unsure whether the approach you describe in your function would work as the data in the column Breeding Range-Subregion(s) is very messy, For example:
w Himalayas to s Siberia and w Mongolia
Himalayas to c China
e Afghanistan to nw India and w Nepal
e Afghanistan to w Tibetan plateau and n India
Africa south of the Sahara, s and se Asia
None of these values is identical to "India" or "himalayas" or "SE Asia" and none will be returned by your function which looks for an exact match. You would need to use grep to find the substring present within your data.
Lets create a toy data set.
bird_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"Foghorn Leghorn",
"The Road Runner",
"Angry Birds"))
ioc_data <- data.frame(
Scientific.name=c(
"Chicken Little",
"Woodstock",
"Woody Woodpecker",
"Donald Duck",
"Daffy Duck",
"Big Bird",
"Tweety Pie",
"The Road Runner",
"Angry Birds"),
subrange=c(
"Australia, New Zealand",
"w Himalayas to s Siberia and w Mongolia",
"Himalayas to c China",
"e Afghanistan to nw India and w Nepal",
"e Afghanistan to w Tibetan plateau and n India",
"Africa south of the Sahara, s and se Asia",
"Amazonia to n Argentina",
"n Eurasia",
"n North America"))
I would break what you are attempting to do into two steps.
Step 1
Use grep to subset the ioc_data dataframe based upon whether your search terms are found in the subrange column:
searchTerms <- c("India", "himalayas", "SE Asia")
#Then we use grep to return the indexes of matching rows:
matchingIndexes <- grep(paste(searchTerms, collapse="|"),
ioc_data$subrange,
ignore.case=TRUE) #Important so search such as "SE Asia" will match "se asia"
#We can then use our matching indexes to subset our ioc_data dataframe producing
#a subset of data corresponding to our range of interest:
ioc_data_subset <- ioc_data[matchingIndexes,]
Step 2
If I understand your question correctly you now want to extract the rows from bird_data that ARE NOT present in the ioc_data_subset (i.e. Which rows in bird_data are for birds that ARE NOT recorded as inhabiting the subrange "India", "SE Asia", and "Himalayas" in the IOC Data.
I would use Hadley Wickham's dplyr package for this - a good cheat sheet can be found here. After installing dplyr:
library(dplyr)
#Create a merged dataframe containing all the data in one place.
merged_data <- dplyr::left_join(bird_data,
ioc_data,
by = "Scientific.name")
#Use an anti_join to select any rows in merged_data that are NOT present in
#ioc_data_subset
results <- dplyr::anti_join(merged_data,
ioc_data_subset,
by = "Scientific.name")
The left_join is required first because otherwise we would not have the subrange column in our final database. Note that any species in bird_data not in IOC_data will return NA in the subrange column to indicate no data found.
results
Scientific.name subrange
1 Angry Birds n North America
2 The Road Runner n Eurasia
3 Foghorn Leghorn <NA>
4 Tweety Pie Amazonia to n Argentina
5 Chicken Little Australia, New Zealand

Related

Matching word list to individual words in strings using R

Edit: Fixed data example issue
Background/Data: I'm working on a merge between two datasets: one is a list of the legal names of various publicly traded companies and the second is a fairly dirty field with company names, individual titles, and all sorts of other difficult to predict words. The company name list is about 14,000 rows and the dirty data is about 1.3M rows. Not every publicly traded company will appear in the dirty data and some may appear multiple times with different presentations (Exxon Mobil, Exxon, ExxonMobil, etc.).
Accordingly, my current approach is to dismantle the publicly traded company name list into the individual words used in each title (after cleaning out some common words like company, corporation, inc, etc.), resulting in the data shown below as Have1. An example of some of the dirty data is shown below as Have2. I have also cleaned these strings to eliminate words like Inc and Company in my ongoing work, but in case anyone has a better idea than my current approach, I'm leaving the data as-is. Additionally, we can assume there are very few, if any, exact matches in the data and that the Have2 data is too noisy to successfully use a fuzzy match without additional work.
Question: What is the best way to go about determining which of the items in Have2 contains the words from Have1? Specifically, I think I need the final data to look like Want, so that I can then link the public company name to the dirty data name. The plan is to hand-verify the matches given the difficult of the Have2 data, but if anyone has any suggestions on another way to go about this, I am definitely open to suggestions (please, someone, have a suggestion haha).
Tried so far: I have code that sort of works, but takes ages to run and seems inefficient. That is:
library(data.table)
library(stringr)
company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")
have1 <- data.table(table(str_split(company_name_data, "\\W+", simplify = TRUE)))[!V1 == "inc"]
have2 <- c("ceo and director, apple inc",
"current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
"xerox inc., president and ceo",
"president and ceo of the amazon apple assn., division 4")
#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\\W+", simplify = TRUE))
#Creates container
store <- data.table()
#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix
for (i in 1:nrow(have1)){
matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\\b",have1$V1[i],"\\b"), have3[x,])))])
if (nrow(matches) == 0){
next
}
#Create combo data
matches[, have1_word := have1$V1[i]]
#Storage
store <- rbind(store, matches)
}
Want
Name (from Have2)
Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
radiation
vp and general bird aficionado of the amazon apple assn. branch F
amazon
vp and general bird aficionado of the amazon apple assn. branch F
apple
ceo and director, apple inc
apple
xerox inc., president and ceo
xerox
Have1
Word
N
amazon
1
apple
3
xerox
1
notgoingtomatch
2
radiation
1
Have2
Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F
Using what you have documented, in terms of data from company_name_data and have2 only:
library(tidytext)
library(tidyverse)
#------------ remove stop words before tokenization ---------------
# now split each phrase, remove the stop words, rejoin the phrases
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data, # split the phrases into individual words,
# remove stop words then reassemble phrases
function(x) {
paste(unlist(strsplit(x,
" ")
)[!(unlist(strsplit(x,
" ")) %in% (stop_words$word %>%
unlist())
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
function(x){
paste(unlist(strsplit(x,
" ")
)[(unlist(strsplit(x,
" ")) %in% comp2
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
The results in the second column, based on the text analysis are "apple," "radiation," "xerox," and "amazon apple."
I'm certain this code isn't mine originally. I'm sure I got these ideas from somewhere on StackOverflow...

How to reformat similar text for merging in R?

I am working with the NYC open data, and I am wanting tho merge two data frames based on community board. The issue is, the two data frames have slightly different ways of representing this. I have provided an example of the two different formats below.
CommunityBoards <- data.frame(FormatOne = c("01 BRONX", "05 QUEENS", "15 BROOKLYN", "03 STATEN ISLAND"),
FormatTwo = c("BRONX COMMUNITY BOARD #1", "QUEENS COMMUNITY BOARD #5",
"BROOKLYN COMMUNITY BOARD #15", "STATEN ISLAND COMMUNITY BD #3"))
Along with the issue of the placement of the numbers and the "#", the second data frame shortens "COMMUNITY BOARD" to "COMMUNITY BD" just on Staten Island. I don't have a strong preference of what string looks like, so long as I can discern borough and community board number. What would be the easiest way to reformat one or both of these strings so I could merge the two sets?
Thank you for any and all help!
You can use regex to get out just the district numbers. For the first format, the only thing that matters is the begining of the string before the space, hence you could do
districtsNrs1 <- as.numeric(gsub("(\\d+) .*","\\1",CommunityBoards$FormatOne))
For the second, I assume that the formats look like "something HASHTAG number", hence you could do
districtsNrs2 <- as.numeric(gsub(".* #(\\d+)","\\1",CommunityBoards$FormatTwo))
to get the pure district numbers.
Now you know how to extract the district numbers. With that information, you can name/reformat the district-names how you want.
To know which district number is which district, you can create a translation data.frame between the districts and numbers like
districtNumberTranslations <- data.frame(
districtNumber = districtsNrs2,
districtName = sapply(strsplit(CommunityBoards$FormatTwo," COMMUNITY "),"[[",1)
)
giving
# districtNumber districtName
#1 1 BRONX
#2 5 QUEENS
#3 15 BROOKLYN
#4 3 STATEN ISLAND

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Categorizing a data frame column in R using grep on a list of a list

I have a column of a data frame that I want to categorize.
> df$orgName
[1] "Hank Rubber" "United Steel of Chicago"
[3] "Muddy Lakes Solar" "West cable"
I want to categorize the column using the categories list below that contains a list of subcategories.
metallurgy <- c('steel', 'iron', 'mining', 'aluminum', 'metal', 'copper' ,'geolog')
energy <- c('petroleum', 'coal', 'oil', 'power', 'petrol', 'solar', 'nuclear')
plastics <- c('plastic', 'rubber')
wiring <- c('wire', 'cable')
categories = list(metallurgy, energy, plastics, wiring)
So far I've been able to use a series of nested ifelse statements to categorize the column as shown below, but the number of categories and subcategories keeps increasing.
df$commSector <-
ifelse(grepl(paste(metallurgy,collapse="|"),df$orgName,ignore.case=TRUE), 'metallurgy',
ifelse(grepl(paste(energy,collapse="|"),df$orgName,ignore.case=TRUE), 'energy',
ifelse(grepl(paste(plastics,collapse="|"),df$orgName,ignore.case=TRUE), 'plastics',
ifelse(grepl(paste(wiring,collapse="|"),df$orgName,ignore.case=TRUE), 'wiring',''))))
I've thought about using a set of nested lapply statements, but I'm not too sure how to execute it.
Lastly does anyone know of any R Libraries that may have functions to do this.
Thanks a lot for everyone's time.
Cheers.
One option would be to get the vectors as a named list using mget, then paste the elements together (as showed by OP), use grep to find the index of elements in 'orgName' that matches (or use value = TRUE) extract those elements, stack it create a data.frame.
res <- setNames(stack(lapply(mget(c("metallurgy", "energy", "plastics", "wiring")),
function(x) df$orgName[grep(paste(x, collapse="|"),
tolower(df$orgName))])), c("orgName", "commSector"))
res
# orgName commSector
#1 United Steel of Chicago metallurgy
#2 Muddy Lakes Solar energy
#3 Hank Rubber plastics
#4 West cable wiring
If we have other columns in 'df', do a merge
merge(df, res, by = "orgName")
# orgName commSector
#1 Hank Rubber plastics
#2 Muddy Lakes Solar energy
#3 United Steel of Chicago metallurgy
#4 West cable wiring
data
df <- data.frame(orgName = c("Hank Rubber", "United Steel of Chicago",
"Muddy Lakes Solar", "West cable"), stringsAsFactors=FALSE)

`data.table` way to select subsets based on `agrep`?

I'm trying to convert from data.frame to data.table, and need some advice on some logical indexing I am trying to do on a single column. Here is a table I have:
places <- data.table(name=c('Brisbane', 'Sydney', 'Auckland',
'New Zealand', 'Australia'),
search=c('Brisbane AU Australia',
'Sydney AU Australia',
'Auckland NZ New Zealand',
'NZ New Zealand',
'AU Australia'))
# name search
# 1: Brisbane Brisbane AU Australia
# 2: Sydney Sydney AU Australia
# 3: Auckland Auckland NZ New Zealand
# 4: New Zealand NZ New Zealand
# 5: Australia AU Australia
setkey(places, search)
I want to extract rows whose search column matches all words in a list, like so:
words <- c('AU', 'Brisbane')
hits <- places
for (w in words) {
hits <- hits[search %like% w]
}
# I end up with the 'Brisbane AU Australia' row.
I have one question:
Is there a more data.table-way to do this? It seems to me that storing hits each time seems like a data.frame way to do this.
This is subject to the caveat that I eventually want to use agrep rather than grep/%like%:
words <- c('AU', 'Bisbane') # note the mis-spelling
hits <- places
for (w in words) {
hits <- hits[agrep(w, search)]
}
I feel like this doesn't quite take advantage of data.table's capabilities and would appreciate thoughts on how to modify the code so it does.
EDIT
I want the for loop because places is quite large, and I only want to find rows that match all the words. Hence I only need to search in the results for the last word for the next word (that is, successively refine the results).
With the talk of "binary scan" vs "vector scan" in the data.table introduction (i.e. "bad way" is DT[DT$x == "R" & DT$y == "h"], "good way" is setkey(DT, x, y); DT[J("R", "h")] I just wondered if there was some way I could apply this approach here.
Mathematical.coffee, as I mentioned under comments, you can not "partial match" by setting a column (or more columns) as key column(s). That is, in the data.table places, you've set the column "search" as the key column. Here, you can fast subset by using data.table's binary search (as opposed to vector scan subsetting) by doing:
places["Brisbane AU Australia"] # binary search when "search" column is key'd
# is faster compared to:
places[search == "Brisbane AU Australia"] # vector scan
But in your case, yo require:
places["AU"]
to give all rows with has a partial match of "AU" within the key column. And this is not possible (while it's certainly a very interesting feature to have).
If the substring you're searching for by itself does not contain mismatches, then you can try splitting the search strings into separate columns. That is, the column search if split into three columns containing Brisbane, AU and Australia, then you can set the key of the data.table to the columns that contain AU and Brisbane. Then, you can query the way you mention as:
# fast subset, AU and Brisbane are entries of the two key columns
places[J("AU", "Brisbane")]
You can vectorize the agrep function to avoid looping.
Note that the result of agrep2 is a list hence the unlist call
words <- c("Bisbane", "NZ")
agrep2 <- Vectorize(agrep, vectorize.args = "pattern")
places[unlist(agrep2(words, search))]
## name search
## 1: Brisbane Brisbane AU Australia
## 2: Auckland Auckland NZ New Zealand
## 3: New Zealand NZ New Zealand

Resources