Fuzzyjoin match based on two different columns instead of one? - r

I would like to ask a question regarding fuzzyjoin package. I am very new to R, and I promise I have read through the readme file and followed through examples on https://cran.r-project.org/web/packages/fuzzyjoin/index.html before I asked this question.
I have a list of vernacular names which I wanted to match with plant species names. A simple version of my list will look like below. Data 1 has a LocalName column with many typos of vernacular name. Data 2 is the table with correct local name and species where the matching should be based on.
data1 <- data.frame(Item=1:5, LocalName=c("BACTERIA F", "BAHIA", "BAIKEA", "BAIKIA", "BAIKIAEA SP"))
data 1
Item LocalName
1 1 BACTERIA F
2 2 BAHIA
3 3 BAIKEA
4 4 BAIKIA
5 5 BAIKIAEA SP
data2 <- data.frame(LocalName=c("ENGOKOM","BAHIA","BAIKIA","BANANIER","BALANITES"), Species=c("Barteria fistulosa","Mitragyna spp","Baikiaea spp", "Musa spp", "Balanites wilsoniana"))
data2
LocalName Species
1 ENGOKOM Barteria fistulosa
2 BAHIA Mitragyna spp
3 BAIKIA Baikiaea spp
4 BANANIER Musa spp
5 BALANITES Balanites wilsoniana
I tried using the stringdist_left_join function, and it managed to match many species correctly. I am being conservative by setting max_dist=1 because in my list, many vernacular names are very similar.
library(fuzzyjoin)
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"), max_dist=1)
table
Item LocalName.x LocalName.y Species
1 1 BACTERIA F <NA> <NA>
2 2 BAHIA BAHIA Mitragyna spp
3 3 BAIKEA BAIKIA Baikiaea spp
4 4 BAIKIA BAIKIA Baikiaea spp
5 5 BAIKIAEA SP <NA> <NA>
However, I have one question. As you can see from data1, the Item 5 BAIKIAEA SP actually matches with the Species column of data2 instead of LocalName. I have many entries like this where the LocalName in data 1 were either typos of vernacular names or species name, however, I am not sure how to make stringdist_left_join matches two columns of data 2 with one column of data 1. I tried modifying the codes into something like this:
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"|"Species"), max_dist=1)
but it did not work, citing "Error in "LocalName" | "Species" :
operations are possible only for numeric, logical or complex types". Does anyone know whether such matching is possible? Thanks in advance!

Related

Merge two data frames - no unique identifier

I would like to combine two data frames. One is information for birds banded. The other is information on recovered banded birds. I would like to add the recovery data to the banding data, if the bird was recovered (not all birds were recovered). Unfortunately the full band number is not included in the banding data, only in the recovery data, so there is not a unique column to join them by.
One looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band Prefix Plus
-85.41667
42.41667
8
5
2001
12456
-85.41655
36.0833
9
6
2003
21548
The other looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band
R Month
R Year
-85.41667
42.41667
8
5
2001
124565482
12
2002
-85.41655
36.0833
9
6
2003
215486256
1
2004
I have tried '''merge''', '''ifelse''', '''dplyr-join''' with no luck. Any suggestions? Thanks in advance!
you should look up rbind(). That might do the trick. For it to work the data frames have to have the same columns. I'd suggest you to add missing columns to your first DF with dplyr::mutate() and later on eliminate useless rows.

Text-mining including patterns and numbers

Dataset contains a free text field with information on building plans. I need to split the content of the field in 2 parts, the first part contains only the number of planned buildings, the other only the type of building. I have a reference lexicon list with the types of buildings.
Example
Plans<- c("build 10 houses ","5 luxury apartments with sea view",
"renovate 20 cottages"," transform 2 bungalows and a school", "1 hotel")
Reference list
Types <-c("houses", "cottages", "bungalows", "luxury apartments")
Desired Output 2 colums, Number and Type, with this content:
Number Type
10 houses
5 apartments
20 cottages
2 bungalows
Tried
matches <- unique (grep(paste(Types,collapse="|"), Plans, value=TRUE))
I can match the plans and types, but I can’t extract the numbers and types into two columns.
I tried str_split_fixed and grepl using :digit: and :alpha: but it isn’t working.
Assuming there is only going to be one numeric part in the string, we can extract the numeric part by replacing all the characters to empty strings. We create the Type column by extracting any of the string present in the Plans.
library(stringr)
data.frame(Number = as.numeric(gsub("[[:alpha:]]", "", Plans)),
Type = str_extract(Plans, paste(Types,collapse="|")))
# Number Type
#1 10 houses
#2 5 luxury apartments
#3 20 cottages
#4 2 bungalows
#5 1 <NA>
For the 5th row, "hotel" is not present in Types so it gives output as NA, if you need to ignore such cases you can do it with is.na. Extracting number from the string part is taken from here.
You can also use strcapture from base R:
strcapture(pattern = paste0("(\\d+)\\s(",paste(Types,collapse="|"),")"),x = Plans,
proto = data.frame(Number=numeric(),Type=character()))
Number Type
1 10 houses
2 5 luxury apartments
3 20 cottages
4 2 bungalows
5 NA <NA>

How do I replace values in an R dataframe column with a corresponding value?

Ok, so I have a dataframe that I downloaded from Pew Research Center. One of the columns (called 'cregion') contains a series of numbers from 1-56, with each number corresponding to a geographic location in the U.S. Most of these locations are states, and the additional 6 are at the sub-state level. So, for example, the number '1' corresponds to 'Alabama', and '11' corresponds to the 'District Of Columbia'.
What I'd like to do is replace each of those numbers in the 'cregion' column with the ACTUAL name of the region it corresponds to. Unfortunately, there is no column in this data frame that I can use to swap the values, as the key for which number corresponds to which region exists completely separately (word document). I'm new to R and while I've been searching for a few hours for the best way to go about this, I can't seem to find a method that would work (or I just don't understand the explanation). Can anybody suggest a method to me?
If you have a vector of the state names as strings called statevec whose ith element corresponds to cregion i, and your data frame is named dat, just do
dat <- data.frame(cregion = sample(1:50), stuff = runif(50))
head(dat)
# cregion stuff
#1 25 0.665843896
#2 11 0.144631131
#3 13 0.691616240
#4 28 0.507454243
#5 9 0.416535139
#6 30 0.004196311
statevec <- state.name
dat$cregion <- statevec[dat$cregion]
head(dat)
# cregion stuff
#1 Missouri 0.665843896
#2 Hawaii 0.144631131
#3 Illinois 0.691616240
#4 Nevada 0.507454243
#5 Florida 0.416535139
#6 New Jersey 0.004196311

Consolidate data table factor levels in R

Suppose I have a very large data table, one column of which is "ManufacturerName". The data was not entered uniformly, so it's pretty messy. For example, there may be observations like:
ABC Inc
ABC, Inc
ABC Incorporated
A.B.C.
...
Joe Shmos Plumbing
Joe Shmo Plumbing
...
I am looking for an automated way in R to try and consider similar names as one factor level. I have learned the syntax to manually do this, for example:
levels(df$ManufacturerName) <- list(ABC=c("ABC", "A.B.C", ....), JoeShmoPlumbing=c(...))
But I'm trying to think of an automated solution. Obviously it's not going to be perfect as I can't anticipate every type of permutation in the data table. But maybe something that searches the factor levels, strips out punctuation/special characters, and creates levels based on common first words. Or any other ideas. Thanks!
Look into the stringdist package. For starters, you could do something like this:
library(stringdist)
x <- c("ABC Inc", "ABC, Inc", "ABC Incorporated", "A.B.C.", "Joe Shmos Plumbing", "Joe Shmo Plumbing")
d <- stringdistmatrix(x)
# 1 2 3 4 5
# 2 1
# 3 9 10
# 4 6 7 15
# 5 16 16 16 18
# 6 15 15 15 17 1
For more help, see ?stringdistmatrix or do searches on StackOverflow for fuzzy matching, approximate string matching, string distance functions, and agrep.

using subset with a string in R

I have the following data frame in R (made up stuff to learn the program):
country population civilised
1 Town 13 5
2 city 69 9
3 Home 24 2
4 Stuff 99 9
and I am trying to access specific rows with the subset function, like
test <- subset(t, country==Town. But all ever get is object not found.
We need to quote the string.
test <- subset(t, country=='Town')
test
# country population civilised
#1 Town 13 5
NOTE; t is a function name (Check ?t). It is better to name objects that are not function names.

Resources