Replacing strings using fuzzywuzzyR - r

I have a large data set with city names. Many of the names are not consistent.
Example:
vec = c("New York", "New York City", "new York CIty", "NY", "Berlin", "BERLIn", "BERLIN", "London", "LONDEN", "Lond", "LONDON")
I want to use fuzzywuzzyR to bring them into a consistent format. The problem is I a have no master list of the original city names.
This package provides the possibility to detect duplicates like this:
library(fuzzywuzzyR)
init_proc = FuzzUtils$new()
PROC = init_proc$Full_process
init_scor = FuzzMatcher$new()
SCOR = init_scor$WRATIO
init = FuzzExtract$new()
init$Dedupe(contains_dupes = vec, threshold = 70L, scorer = SCOR)
dict_keys(['New York City', 'NY', 'BERLIN', 'LONDEN'])
Or I can set a "master value" like this:
master = "London"
init$Extract(string = master, sequence_strings = vec, processor = PROC, scorer = SCOR)
[[1]]
[[1]][[1]]
[1] "London"
[[1]][[2]]
[1] 100
[[2]]
[[2]][[1]]
[1] "LONDON"
[[2]][[2]]
[1] 100
[[3]]
[[3]][[1]]
[1] "Lond"
[[3]][[2]]
[1] 90
[[4]]
[[4]][[1]]
[1] "LONDEN"
[[4]][[2]]
[1] 83
[[5]]
[[5]][[1]]
[1] "NY"
[[5]][[2]]
[1] 45
My question is how can I use this to replace all matches in the list with the same value i.e. I would like to replace all values that match the master value with "London". However, I don´t have the master values. So, I need to have a list of matches and replace the values. In this case it would be "New York", "London" "Berlin". After the process, vec should looklike this.
new_vec = c("New York", "New York", "New York", "New York", "Berlin", "Berlin", "Berlin", "London", "London", "London", "London")
Update
#camille came up with the idea of using world.cities of the maps package. I found this post using fuzzyjoin dealing with a similar problem.
To use this I convert vec to a data frame.
vec = as.data.frame(vec, stringsAsFactors = F)
colnames(vec) = c("City")
Then using the fuzzyjoin package together with world.cities of the maps package.
library(maps)
library(fuzzyjoin)
vec %>%
stringdist_left_join(world.cities, by = c(City = "name"), distance_col = "d") %>%
group_by(City) %>%
top_n(1)
The output looks like this:
# A tibble: 50 x 3
# Groups: City [5]
City name d
<chr> <chr> <dbl>
1 New York New York 0
2 NY Ae 2
3 NY Al 2
4 NY As 2
5 NY As 2
6 NY As 2
7 NY Au 2
8 NY Ba 2
9 NY Bo 2
10 NY Bo 2
# ... with 40 more rows
The Problem is that I have no Idea how to use the distance between ´nameandCity` to change the misspelled values into the right ones for all cities. In theory the corret value must be the closest one. But i.e. for NY this not the case.

Related

Convert list from API do dataframe

I have a data as a list which looks like as following.
name type value
Api_collect list [5] List of length 5
country character [1] US
state character [1] Texas
computer character [1] Mac
house character [1] Mansion
president character [1] Trump
The following code have I runned in R.
api_col <- base::rawToChar((response$country))
as.data.frame(api_json$country)
And results in this df:
country
US
How to transfer this list to a dataframe with every column of Api_collect except for house?
Here's an option using purrr::map_df() and dplyr::select():
# name type value
# Api_collect list [5] List of length 5
# country character [1] US
# state character [1] Texas
# computer character [1] Mac
# house character [1] Mansion
# president character [1] Trump
library(dplyr)
library(purrr)
your_list <- list(
country = "US",
state = "Texas",
computer = "Mac",
house = "Mansion",
president = "Biden"
)
purrr::map_df(your_list, ~.x) %>% select(-country)
Which gives:
# A tibble: 1 × 4
state computer house president
<chr> <chr> <chr> <chr>
1 Texas Mac Mansion Biden

How do I rename the values in my column as I have misspelt them and cant rename them in R or Colab

I have a data frame that was given to me. Under the column titled state, there are two components with the same name but with different case sensitivities ie one is "London" and the other is "LONDON". How would i be able to rename "LONDON" to become "London" in order to total them up together and not separately. reminder, I am trying to change the name of the input not the name of the column.
You can use the following code, df is your current dataframe, in which you want to substitute "LONDON" for "London"
df <- data.frame(Country = c("US", "UK", "Germany", "Brazil","US", "Brazil", "UK", "Germany"),
State = c("NY", "London", "Bavaria", "SP", "CA", "RJ", "LONDON", "Berlin"),
Candidate = c(1:8))
print(df)
output
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK LONDON 7
8 Germany Berlin 8
then run the following code to substitute London to all the instances where State is equal to "LONDON"
df[df$State == "LONDON", "State"] <- "London"
Now the output will be as
Country State Candidate
1 US NY 1
2 UK London 2
3 Germany Bavaria 3
4 Brazil SP 4
5 US CA 5
6 Brazil RJ 6
7 UK London 7
8 Germany Berlin 8
Maybe you could try using the case_when function. I would do something like this:
´´´´
mutate(data, State_def=case_when(State=="LONDON" ~ "London",
State=="London" ~ "London",
TRUE ~ NA_real_)
I might misunderstand, but I think it should be as simple as this:
x$state <- sub( "LONDON", "London", x$state, fixed=TRUE )
This should change LONDON to London

Need to ID states from mixed names /IDs in location data

Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>

How to unite a lot of data frames via a data frames' names in a list

I have these two data frames (in real life I have 680):
DF1 <- data.frame(City = c("New York", "New York", "New York", "New York", "New York"), Income = c("1","2","3","4","5"), Amount = c(0, 13291.23678, 0,0,0))
DF2 <- data.frame(City = c("Dallas", "Dallas","Dallas","Dallas","Dallas"), Income = c("1","2","3","4","5"),Amount = c(0, 65666.2885,106896.3682,69949.63342,35549.94405))
I must combine those two data frames via rows, but I have data frame' names in a list.
My expective outcome is to use that list to combine those data frames something like this:
DF<- data.frame(rbind(DFNames))
And generating this data frame:
City Income Amount
New York 1 0.00
New York 2 13291.24
New York 3 0.00
New York 4 0.00
New York 5 0.00
Dallas 1 0.00
Dallas 2 65666.29
Dallas 3 106896.37
Dallas 4 69949.63
Dallas 5 35549.94

Find levels of a factors that appear more than once

I have this dataframe:
data <- data.frame(countries=c(rep('UK', 5),
rep('Netherlands 1a', 5),
rep('Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(40))
countries var
1 UK 0.506232270
2 UK 0.976348808
3 UK -0.752151769
4 UK 1.137267199
5 UK -0.363406715
6 Netherlands 1a -0.800835463
7 Netherlands 1a 1.767724231
8 Netherlands 1a 0.810757929
9 Netherlands 1a -1.188975114
10 Netherlands 1a -0.763144245
11 Netherlands 0.428511920
12 Netherlands 0.835184425
13 Netherlands -0.198316780
14 Netherlands 1.108191193
15 Netherlands 0.946819500
16 USA 0.226786121
17 USA -0.466886468
18 USA -2.217910876
19 USA -0.003472937
20 USA -0.784264921
21 spain -1.418014562
22 spain 1.002412706
23 spain 0.472621627
24 spain -1.378960222
25 spain -0.197020702
26 Spain 1.197971896
27 Spain 1.227648883
28 Spain -0.253083684
29 Spain -0.076562960
30 Spain 0.338882352
31 Spain 1a 0.074459521
32 Spain 1a -1.136391220
33 Spain 1a -1.648418916
34 Spain 1a 0.277264011
35 Spain 1a -0.568411569
36 spain 1a 0.250151646
37 spain 1a -1.527885883
38 spain 1a -0.452190849
39 spain 1a 0.454168927
40 spain 1a 0.889401396
I want to be able to find levels of countries that appear in different forms more than once. Forms that levels of countries might appear in are:
lowercase, for example "spain"
titlecase, for example "Spain"
lowercase with a different word attached, for example "spain 1a"
titlecase with a different word attached, for example "Spain 1a"
So I need to function to return a vector listing levels countries that appear more than once. In data, the vector that should be returned is:
"Netherlands 1a", "Netherlands", "spain", "Spain", "spain 1a", "Spain 1a"
Is it possible to make a function that would return this vector?
A quick solution that should meet all requirements (assuming that the country name is always the first element of your data$country entries):
# Country substrings
country.substr <- sapply(strsplit(tolower(levels(data$countries)), " "), "[[", 1)
# Duplicated country substrings
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a"
Update:
Assuming that the country name is not always to be found at the first position, you need to apply a different approach that I took from here. Note that I slightly modified your sample data to clarify what I'm doing:
data <- data.frame(countries=c(rep('United Kingdom', 5),
rep('united kingdom', 5),
rep('Netherlands', 5),
rep('Netherlands 1a', 5),
rep('1a Netherlands', 5),
rep('USA', 5),
rep('spain', 5),
rep('Spain', 5),
rep('Spain 1a', 5),
rep('spain 1a', 5)),
var=rnorm(50))
Now let's identify all country substrings that do NOT contain any numerics. The subsequent steps remain the same. Is that what you need?
# Remove mixed numeric/alphabetic parts from country names
country.substr <- lapply(strsplit(tolower(levels(data$countries)), " "), function(i) {
# Identify, paste and return alphabetic-only components
tmp <- grep("^[[:alpha:]]*$", i)
if (length(tmp) == 1)
return(i[tmp])
else
return(paste(i[tmp], collapse = " "))
})
# Identify douplicated country names
country.substr.dupl <- duplicated(country.substr)
# Display all country levels that appear in different forms
do.call("c", lapply(unique(country.substr[country.substr.dupl]), function(i) {
levels(data$countries)[grep(i, tolower(levels(data$countries)))]
}))
[1] "1a Netherlands" "Netherlands" "Netherlands 1a" "spain" "Spain" "spain 1a" "Spain 1a" "united kingdom" "United Kingdom"
Why not use grep? The ignore.case argument is just what you need here.
> uch <- unique(as.character(data$countries))
> found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
> ff <- found[sapply(found, function(x) length(x) > 1)]
> unique(unlist(ff))
# [1] "Netherlands 1a" "Netherlands" "spain"
# [4] "Spain" "Spain 1a" "spain 1a"
Here's my logic: Take the unique factor levels of the column as a character vector. Then, compare it with itself, looking only at those levels that do not contain a space or a digit. grep will catch those, but the other way around is a bit more tough. Then, we just find the unique matches. So here's a function and a test run,
find.matches <- function(column)
{
uch <- unique(as.character(column))
found <- sapply(seq(uch), function(i){
if(!grepl("\\s|[0-9]", uch[i]))
grep(uch[i], uch, ignore.case = TRUE, value = TRUE)
})
ff <- found[sapply(found, function(x) length(x) > 1)]
unique(unlist(ff))
}
> dat <- data.frame(x = c("a", "a1", "a 1b", "c", "d"),
y = c("fac", "tor", "fac 1a", "tor1a", "fac"))
> sapply(dat, find.matches)
# $x
# [1] "a" "a1" "a 1b"
#
# $y
# [1] "fac" "fac 1a" "tor" "tor1a"

Resources