Convert list from API do dataframe - r

I have a data as a list which looks like as following.
name type value
Api_collect list [5] List of length 5
country character [1] US
state character [1] Texas
computer character [1] Mac
house character [1] Mansion
president character [1] Trump
The following code have I runned in R.
api_col <- base::rawToChar((response$country))
as.data.frame(api_json$country)
And results in this df:
country
US
How to transfer this list to a dataframe with every column of Api_collect except for house?

Here's an option using purrr::map_df() and dplyr::select():
# name type value
# Api_collect list [5] List of length 5
# country character [1] US
# state character [1] Texas
# computer character [1] Mac
# house character [1] Mansion
# president character [1] Trump
library(dplyr)
library(purrr)
your_list <- list(
country = "US",
state = "Texas",
computer = "Mac",
house = "Mansion",
president = "Biden"
)
purrr::map_df(your_list, ~.x) %>% select(-country)
Which gives:
# A tibble: 1 × 4
state computer house president
<chr> <chr> <chr> <chr>
1 Texas Mac Mansion Biden

Related

How can I parse a variable into multiple columns according to multiple conditions in R?

I'm new to R, so please bear with me. I am looking at incarceration data, and have a variable conviction, which is a messy string that looks like this:
[1] "Ct. 1: Conspiracy to distribute"
[2] "Aggravated Assault"
[3] "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"
[4] "Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling"
Ideally, I want to do two things. First, I want to parse on Ct. into multiple columns. For the first three rows, the data would look like this:
convictions conviction_1 conviction_2
[1,] "Ct. 1: Conspiracy to distribute" "Conspiracy to distribute" NA
[2,] "Aggravated Assault" "Aggravated Assault" NA
[3,] "Ct. 1: Possession of prohibited object" "Possession of prohibited object" "criminal forfeiture"
but things get hairy when I get to the third row, because I would want to parse the first part of the string(Ct. 1-6: Human Trafficking) into 6 columns, and then Ct. 7,8: Unlawful contact into 2 more columns.
The second part is that I then want to generate a variable convictions_total which would find the highest number in the conviction string that follows after Ct:. for the three example entries I included here, convictions_total would look like:
[1] 1 2 36
This is the code I used to parse a much more straight-forward string variable, but I'm unsure how to tweak it for this variable:
cols <- data.frame(str_split_fixed(data$convictions`,",",Inf))
colnames(cols) <- paste0("conviction_",rep(1:length(cols)))
data <- cbind(data,cols)
Thank you in advance!
The following works for your example without using much regular expression, mostly just digit extraction or other string detection:
library(stringr)
library(magrittr)
library(purrr)
library(plyr)
convictions_total <- sapply(stringr::str_extract_all(convictions, "\\d+"),
function(x) max(as.numeric(x), 1))
convictions_split <- strsplit(convictions, ";")
reps <- lapply(convictions_split, FUN = function(x) {
sapply(x, FUN = function(i) {
num <- paste(stringr::str_extract_all(i, "[\\d+\\-,]")[[1]], collapse = "")
# "-" indicates a range: take largest value
if (stringr::str_detect(num, "-")){
stringr::str_extract_all(num, "\\d+") %>%
unlist() %>%
as.numeric() %>%
max() %>%
return()
# "," indicates a sequence: get length of sequence
} else if(stringr::str_detect(num, ",")){
stringr::str_count(num, ",") + 1 %>%
as.numeric() %>%
return()
# otherwise return 1
} else {
return(1)
}
})
})
convictions_str <- lapply(convictions_split,
function(x) gsub(".*\\d:?\\s(.*)$", "\\1", x))
df <- purrr::map2(convictions_str, reps, rep) %>%
plyr::ldply(rbind) %>%
cbind(convictions_total, .) %>%
data.frame() %>%
dplyr::rename_with(~ gsub("X", "conviction_", .x), starts_with("X"))
Output
convictions_total conviction_1 conviction_2 conviction_3
1 1 Conspiracy to distribute <NA> <NA>
2 1 Aggravated Assault <NA> <NA>
3 2 Possession of prohibited object criminal forfeiture <NA>
4 36 Human Trafficking Human Trafficking Human Trafficking
conviction_4 conviction_5 conviction_6 conviction_7 conviction_8
1 <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA>
4 Human Trafficking Human Trafficking Human Trafficking Unlawful contact Unlawful contact
conviction_9 conviction_10
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4 Involuntary Servitude Smuggling
Data
convictions <- c("Ct. 1: Conspiracy to distribute",
"Aggravated Assault",
"Ct. 1: Possession of prohibited object; Ct.: 2 criminal forfeiture",
"Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling")
How it works
convictions_total is simple to extract by using stringr::str_extract_all to take all the numbers from each row in convictions. This returns a list of vectors. sapply then takes the maximum from each vector in the list and returns a vector.
reps is a list where the elements correspond to the elements of convictions and it stores a numerical vector of how many times to repeat each conviction count.
The code works by first splitting convictions into a list of vectors where the vectors contain the following extracted information: digits (\\d+), dashes (\\-), and commas (,). The logic works by searching these string extractions:
First, if it finds a "-" in the conviction count then that indicates a range and again it takes the largest value. For example "Ct. 1-6: Human Trafficking" will return 6.
Next, if it does not find a "-", but instead "," that denotes a counting delimiter. So it counts the number of comma delimiters and adds one. For example "Cts. 7, 8 Unlawful contact" will return 2
Everything else is assumed to be repeated only once since it's not a sequential list or a range.
reps
[[1]]
Ct. 1: Conspiracy to distribute
1
[[2]]
Aggravated Assault
1
[[3]]
Ct. 1: Possession of prohibited object Ct.: 2 criminal forfeiture
1 1
[[4]]
Ct. 1-6: Human Trafficking Cts. 7, 8 Unlawful contact Ct. 11: Involuntary Servitude
6 2 1
Ct. 36: Smuggling
1
convictions_str just extracts the actual conviction information. For example, from "Ct. 1: Conspiracy to distribute" the code will extract "Conspiracy to distribute" and so on for all the convictions.
[[1]]
[1] "Conspiracy to distribute"
[[2]]
[1] "Aggravated Assault"
[[3]]
[1] "Possession of prohibited object" "criminal forfeiture"
[[4]]
[1] "Human Trafficking" "Unlawful contact" "Involuntary Servitude"
[4] "Smuggling"
At this point reps and convictions_str have a related structure:
convictions_str[[1]][1] should be repeated reps[[1]][1] times
convictions_str[[1]][2] should be repeated reps[[1]][2] times
purrr::map2 takes advantage of this structure using the rep function to repeat the elements in convictions_str by the values stored in reps and outputs a list. plyr::ldply row binds this list filling with NA since not everyone has the same number of convictions. cbind adds the column convictions_total, and dplyr::rename_with changes the column names.
After going down a two-day rabbit hole, I figured out a tidy version of #LMc's code, which ended up working better because calling plyr was messing up other code I had written:
test_data <-
tibble(id = 1:5,
convictions = c("Ct. 1: Conspiracy to distribute" ,
"Aggravated Assault" ,
"Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture" ,
"Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling 50 grams",
"Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28: Money Laundering"))
test_data <- test_data %>%
mutate(c2 = convictions) #this just duplicates the original variable convictions because I want to preserve it
test_data <- test_data %>%
separate_rows(c2, sep = ";") %>%
mutate(c2 = str_remove(c2, "Ct(s)?(\\. )(\\d|-|:|,|\\s)+")) %>%
group_by(id) %>%
mutate(conviction_number = paste0("c_", row_number())) %>%
pivot_wider(values_from = c2, names_from = conviction_number)
test_data <- test_data %>%
mutate(c2 = convictions) #again, just preserving the original variable
test_data <- test_data %>%
separate_rows(c2, sep = ";") %>%
mutate(total_counts = as.numeric(ifelse(is.na(str_extract(c2, "((?<=\\-)\\d+)")), str_extract(c2, "\\d+"), str_extract(c2, "((?<=\\-)\\d+)")))) %>%
mutate(total_counts = ifelse(is.na(total_counts), 1, total_counts)) %>%
group_by(id) %>%
slice_max(total_counts)
which produces the following dataframe:
id convictions c_1 c_2 c_3 c_4 c2 total_counts
<int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 Ct. 1: Conspiracy to distribute Conspiracy to dis~ NA NA NA "Ct. 1: Conspirac~ 1
2 2 Aggravated Assault Aggravated Assault NA NA NA "Aggravated Assau~ 1
3 3 Ct. 1: Possession of prohibited object; Ct. 2: criminal for~ Possession of pro~ " criminal f~ NA NA " Ct. 2: criminal~ 2
4 4 Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct.~ Human Trafficking " Unlawful c~ " Involuntary~ " Smuggling~ " Ct. 36: Smuggli~ 36
5 5 Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28: Money ~ Conspiracy " Wire Fraud" " Money Laund~ NA " Cts. 8-28: Mon~ 28
The first chunk of code parses the counts into separate rows, and then pivots back to the c_ columns. The second code chunk does the same parsing, but then looks across each entry to parse out the digits, instead of the words.
//d+ looks for any digit, but it turns out I had data that looked like Cts. 2-7 where I wanted the value 7, and not 2.
((?<=\\-)\\d+)")) Looks for the hyphen, and then parses the digits after it. If there is no hyphen, it defaults back to \\d+.
Finally, slice_max collapses the data down to 1 entry per ID based on the highest value of total_counts.

R : Parsing and storing hierarchical structured data separated by space along with quick search and fast access

I have a large data set with several fields along with its value separated by space.
Then these fields are combined to make a single record and each record can have children of variable length Indented with a tab.
content of the file looks something like this :
company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
I want to store these records while maintaining the hierarchical structure and with an option to do quick search and way to access every record.
So far I came up with this approach :
# reading file from the source
path <- "/path/to/file.txt"
content <- readLines(path, warn = F)
# replaces , with ; so it does not translate it as a separator in next step
content <- gsub(",", ";", content)
# creating list of fields and value
contentList <- read.csv(text=sub(" ", ",", content), header=FALSE)
# replacing ; with , to revert data in right format
contentList$V2 <- gsub(";", ",", contentList$V2)
After above step contentList look like this :
In the next step, I thought of using a function that would create a list with these rules:
if the field does not have any \t add it to the list(as named vector)
if the field have one or more \t make it a sub-list(as named vector) of previous record
But don't know how this could be implemented in R.
How should I implement this?
Or Is there a better way to solve this problem that performs searching and accessing values quickly?
Using content from the Note at the end, count the spaces at the beginning of each company line and use gsubfn to replace them with a level number giving L2. Then after trimming away leading spaces replace the first space on each line with a colon giving L3. The file is now in dcf format so read it in using read.dcf giving L4.
Now generate a lv variable giving the level number as a number and generate sequential numeric ids for each row. Compute the parent id giving parent and then construct a data frame with what we have computed so far. The overall root of the tree will be denoted by 0. From DF generate an edgelist, e, for the graph and convert that to an igraph. From that generate the simple paths and create a data frame DF2 having columns paths, company, type, based and industry such that each row represents one node other than the root.
If you wish you can add lv and parent to the data frame which we computed but did not add since you may not need those.
The assumption below is that each indent is 4 spaces.
There is no restriction on how deep the levels can go.
We can search DF2 using data frame operations for various text based queries such as
subset(DF2, grepl("Samsung", paths)) # Samsung and its descendents
or we can use igraph functions for graph queries on g such as
max(length(get.diameter(g))) - 1 # max depth not counting root
or we can use data.tree functions for queries
dt$height - 1 # max depth not counting root
Code
The code follows.
library(gsubfn)
content <- readLines(textConnection(Lines))
L2 <- gsubfn("( *)company", ~ paste0("level ", nchar(x) / 4L + 1L, "\ncompany"), content)
L3 <- sub(" ", ":", trimws(readLines(textConnection(L2))))
L4 <- read.dcf(textConnection(L3))
lv <- as.numeric(L4[, 1])
id <- seq_along(lv)
company <- L4[, "company"]
parent <- sapply(id, function(i) c(tail(which(lv[1:i] < lv[i]), 1), 0)[1])
DF <- data.frame(id = company[id], parent = c("0", company)[parent+1],
level = lv, L4[, -1], stringsAsFactors = FALSE)
e <- with(DF, cbind(parent, id))
igraph
Now that we have an edge list we can create an igraph and process it using that package.
library(igraph)
g <- graph_from_edgelist(e)
p <- all_simple_paths(g, "0")
paths <- sapply(p, function(x) paste(names(x), collapse = "/"))
DF2 <- data.frame(paths, L4[, -1], stringsAsFactors = FALSE)
DF2
giving a paths column followed by the attributes of each node:
paths company type based industry
1 0/Samsung Samsung private South Korea <NA>
2 0/Samsung/Harman International Harman International private United States Electronics
3 0/Samsung/Harman International/JBL JBL subsidiary United States Audio
4 0/Amazaon Amazaon public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
We can plot the graph like this:
plot(g, layout = layout_as_tree(g))
(continued from graph)
data.tree
We could also use data.tree and its many functions to process this:
library(data.tree)
library(DiagrammeR)
dt <- FromDataFrameNetwork(DF)
print(dt, "type", "based", "industry")
giving:
levelName type based industry
1 0
2 ¦--Samsung private South Korea
3 ¦ °--Harman International private United States Electronics
4 ¦ °--JBL subsidiary United States Audio
5 °--Amazaon public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
We can plot or convert the data tree data as follows
plot(dt) # plot in browser
ToListSimple(dt) # convert to nested list
ToListExplicit(dt) # similar but children in children component
Note
We can create content reproducibly like this:
Lines <- "
company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics"
content <- readLines(textConnection(Lines))
RAW DATA IN
raw <- read_lines("company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics")
PUT IN A TIBBLE AND GET THE INDENTUREMENT
library(tidyverse)
rawDf <- tibble(RAW = raw)
companyIndenture <- rawDf %>%
filter(str_detect(RAW, "^\\s*company")) %>%
mutate(LVL = case_when(
str_detect(RAW, "^\\s{8}") ~ 3,
str_detect(RAW, "^\\s{4}") ~ 2,
TRUE ~ 1),
COMPANY = str_replace(RAW, "^\\s*company\\s", "")) %>%
select(-RAW)
# Gives us
# A tibble: 4 x 2
# LVL COMPANY
# <dbl> <chr>
# 1 1 Samsung
# 2 2 Harman International
# 3 3 JBL
# 4 1 Amazaon
CLEAN WHITESPACE
Now that we know what LVL each company is, lets get rid of some whitespace
nextly <- rawDf %>%
mutate(RAW = str_replace(RAW, "^\\s*", "")) %>%
filter(RAW != "") %>%
separate(RAW, c("ATTR", "VALUE"), sep = " ", extra = "merge") %>%
# And bring the LVL back in
left_join(companyIndenture, by = c("VALUE" = "COMPANY")) %>%
select(LVL, ATTR, VALUE)
# A tibble: 15 x 3
# LVL ATTR VALUE
# <dbl> <chr> <chr>
# 1 1 company Samsung
# 2 NA type private
# 3 NA based South Korea
# 4 2 company Harman International
# 5 NA type private
# 6 NA based United States
# 7 NA industry Electronics
# 8 3 company JBL
# 9 NA type subsidiary
# 10 NA based United States
# 11 NA industry Audio
# 12 1 company Amazaon
# 13 NA type public
# 14 NA based United States
# 15 NA industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
DISTRIBUTE THE HIERARCHY
Each company gets a LVL.1, LVL.2, LVL.3 structure. The "" make it work out right when we fill``.
further <- nextly %>%
mutate(LVL.1 = ifelse(LVL == 1, VALUE, NA_character_),
LVL.2 = case_when(LVL == 1 ~ "",
LVL == 2 ~ VALUE,
TRUE ~ NA_character_),
LVL.3 = ifelse(LVL == 3, VALUE, "")) %>%
fill(starts_with("LVL.")) %>%
filter(ATTR != "company") %>%
select(LVL.1, LVL.2, LVL.3, ATTR, VALUE)
# A tibble: 11 x 5
# LVL.1 LVL.2 LVL.3 ATTR VALUE
# <chr> <chr> <chr> <chr> <chr>
# 1 Samsung "" "" type private
# 2 Samsung "" "" based South Korea
# 3 Samsung Harman International "" type private
# 4 Samsung Harman International "" based United States
# 5 Samsung Harman International "" industry Electronics
# 6 Samsung Harman International JBL type subsidiary
# 7 Samsung Harman International JBL based United States
# 8 Samsung Harman International JBL industry Audio
# 9 Amazaon "" "" type public
# 10 Amazaon "" "" based United States
# 11 Amazaon "" "" industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
HANDLE AMAZAON'S MULTIPLE INDUSTRIES
Finally, lets str_split and unnes those 'industry' values for Amazaon.
finally <- further %>%
mutate(VALUE = str_split(VALUE, ",\\s*")) %>%
unnest()
# A tibble: 14 x 5
# LVL.1 LVL.2 LVL.3 ATTR VALUE
# <chr> <chr> <chr> <chr> <chr>
# 1 Samsung "" "" type private
# 2 Samsung "" "" based South Korea
# 3 Samsung Harman International "" type private
# 4 Samsung Harman International "" based United States
# 5 Samsung Harman International "" industry Electronics
# 6 Samsung Harman International JBL type subsidiary
# 7 Samsung Harman International JBL based United States
# 8 Samsung Harman International JBL industry Audio
# 9 Amazaon "" "" type public
# 10 Amazaon "" "" based United States
# 11 Amazaon "" "" industry Cloud computing
# 12 Amazaon "" "" industry e-commerce
# 13 Amazaon "" "" industry artificial intelligence
# 14 Amazaon "" "" industry consumer electronics
Q.E.D.
LAGNAPPE
further %>%
spread(key = "ATTR", value = "VALUE") %>%
mutate(industry = str_split(industry, ",\\s*")) %>%
unnest()
# A tibble: 7 x 6
LVL.1 LVL.2 LVL.3 based type industry
<chr> <chr> <chr> <chr> <chr> <chr>
1 Amazaon "" "" United States public Cloud computing
2 Amazaon "" "" United States public e-commerce
3 Amazaon "" "" United States public artificial intelligence
4 Amazaon "" "" United States public consumer electronics
5 Samsung "" "" South Korea private NA
6 Samsung Harman International "" United States private Electronics
7 Samsung Harman International JBL United States subsidiary Audio

How to count multiple text values in a column in R?

I have a dataframe with a column of city names, in each cell of this column there are multiple text values separated by ",".
For example the first 4 rows of the cities column of my df are:
"Barcelona, Milaan, Londen, Paris, Berlin"
"Barcelona"
"Milaan, Barcelona, Berlin"
"London, Berlin"
I want to count for each row of this column
wheter these cities occurs.
For example, the output needs to look like this:
count_cities
5
1
3
2
Thank you in advance!
DATA:
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona",
"Milaan, Barcelona, Berlin","London, Berlin"), stringsAsFactors = F)
To count how many city namesthere are you can first split the string at ,and count the splits using lengths:
cities$count <- lengths(strsplit(cities$names, ","))
The resulting dataframe is this:
cities
names count
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2
EDIT:
If the strings contain not only city namesbut additional information, you can use str_countto match upper-case letters (because city names begin with an upper-case letter but other words don't, at least not in the example you've given):
cities <- data.frame(names = c("Barcelona, Milaan, Londen, Paris, Berlin","Barcelona (a big city)",
"Milaan, Barcelona, Berlin","London, Berlin (are all capitals, are big cities)"), stringsAsFactors = F)
library(stringr)
cities$count <- str_count(cities$names, "[A-Z][a-z]+")
Alternatively, use str_extract:
cities$count <- lengths(str_extract_all(cities$names, "[A-Z][a-z]+"))
library(tidyverse)
travel <- tibble(CITYS = c("Barcelona, Milaan, Londen, Paris, Berlin",
"Barcelona",
"Milaan, Barcelona, Berlin",
"London, Berlin"))
travel %>%
mutate(CITY.COUNT = map_dbl(str_split(CITYS, ",\\s*"), length))
Yields
# A tibble: 4 x 2
CITYS CITY.COUNT
<chr> <dbl>
1 Barcelona, Milaan, Londen, Paris, Berlin 5
2 Barcelona 1
3 Milaan, Barcelona, Berlin 3
4 London, Berlin 2
Another option is str_count
library(stringr)
str_count(travel$CITYS, "\\w+")
#[1] 5 1 3 2

Replacing strings using fuzzywuzzyR

I have a large data set with city names. Many of the names are not consistent.
Example:
vec = c("New York", "New York City", "new York CIty", "NY", "Berlin", "BERLIn", "BERLIN", "London", "LONDEN", "Lond", "LONDON")
I want to use fuzzywuzzyR to bring them into a consistent format. The problem is I a have no master list of the original city names.
This package provides the possibility to detect duplicates like this:
library(fuzzywuzzyR)
init_proc = FuzzUtils$new()
PROC = init_proc$Full_process
init_scor = FuzzMatcher$new()
SCOR = init_scor$WRATIO
init = FuzzExtract$new()
init$Dedupe(contains_dupes = vec, threshold = 70L, scorer = SCOR)
dict_keys(['New York City', 'NY', 'BERLIN', 'LONDEN'])
Or I can set a "master value" like this:
master = "London"
init$Extract(string = master, sequence_strings = vec, processor = PROC, scorer = SCOR)
[[1]]
[[1]][[1]]
[1] "London"
[[1]][[2]]
[1] 100
[[2]]
[[2]][[1]]
[1] "LONDON"
[[2]][[2]]
[1] 100
[[3]]
[[3]][[1]]
[1] "Lond"
[[3]][[2]]
[1] 90
[[4]]
[[4]][[1]]
[1] "LONDEN"
[[4]][[2]]
[1] 83
[[5]]
[[5]][[1]]
[1] "NY"
[[5]][[2]]
[1] 45
My question is how can I use this to replace all matches in the list with the same value i.e. I would like to replace all values that match the master value with "London". However, I don´t have the master values. So, I need to have a list of matches and replace the values. In this case it would be "New York", "London" "Berlin". After the process, vec should looklike this.
new_vec = c("New York", "New York", "New York", "New York", "Berlin", "Berlin", "Berlin", "London", "London", "London", "London")
Update
#camille came up with the idea of using world.cities of the maps package. I found this post using fuzzyjoin dealing with a similar problem.
To use this I convert vec to a data frame.
vec = as.data.frame(vec, stringsAsFactors = F)
colnames(vec) = c("City")
Then using the fuzzyjoin package together with world.cities of the maps package.
library(maps)
library(fuzzyjoin)
vec %>%
stringdist_left_join(world.cities, by = c(City = "name"), distance_col = "d") %>%
group_by(City) %>%
top_n(1)
The output looks like this:
# A tibble: 50 x 3
# Groups: City [5]
City name d
<chr> <chr> <dbl>
1 New York New York 0
2 NY Ae 2
3 NY Al 2
4 NY As 2
5 NY As 2
6 NY As 2
7 NY Au 2
8 NY Ba 2
9 NY Bo 2
10 NY Bo 2
# ... with 40 more rows
The Problem is that I have no Idea how to use the distance between ´nameandCity` to change the misspelled values into the right ones for all cities. In theory the corret value must be the closest one. But i.e. for NY this not the case.

How to remove specific words in a column

I have a Column consisting of several Country Offices associated a with a company, where I would like to shorten fx: China Country Office and Bangladesh Country Office, to just China or Bangladesh- In other words removing the words "Office" and "Country" from the column called Imp_Office.
I tried using the tm-package, with reference to an earlier post, but nothing happened.
what I wrote:
library(tm)
stopwords = c("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
Where I got the following error message:
Error in gsub(paste0(stopwords, collapse = "|", "", MY_df$Imp_Office))
:
argument "x" is missing, with no default
I also tried using the function readLines:
stopwords = readLines("Office", "Country","Regional")
MY_df$Imp_Office <- gsub(paste0(stopwords, collapse = "|","",
MY_df$Imp_Office))
But this didn't help either
I have considered the possibility of using some other string manipulation method, but I don't need to detect, replace or remove whitespace - so I am kind of lost here.
Thank you.
First, let's set up a dataframe with a column like what you describe:
library(tidyverse)
df <- data_frame(Imp_Office = c("China Country Office",
"Bangladesh Country Office",
"China",
"Bangladesh"))
df
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China Country Office
#> 2 Bangladesh Country Office
#> 3 China
#> 4 Bangladesh
Then we can use str_remove_all() from the stringr package to remove any bits of text that you don't want from them.
df %>%
mutate(Imp_Office = str_remove_all(Imp_Office, " Country| Office"))
#> # A tibble: 4 x 1
#> Imp_Office
#> <chr>
#> 1 China
#> 2 Bangladesh
#> 3 China
#> 4 Bangladesh
Created on 2018-04-24 by the reprex package (v0.2.0).

Resources