approximate character matching using R

approximate character matching using R - r

I have two datafiles. One of the files contains only one column with the name of the company (usually a hospital) and the other one contains a list of companies with the respective adresses. The problem is that the company names do not exactly match. How can i match them approximately ?
> dput(head(HOSPITALS[130:140,], 10))
I would like to obtain one datafile, where the company is matchen with an adress, if available in adress

Check out the fuzzyjoin package and the stringdist_join functions.
Here's a starting point. In your example data ignore_case = TRUE solves the matching problem. Depending on how the full data looks, you will have to experiment with the arguments (e.g. max_dist) and possibly filter the result until your achieve what you want.
library(dplyr)
library(fuzzyjoin)
HOSPITALS %>%
stringdist_left_join(GH_MY,
by = c("hospital" = "hospital_name"),
ignore_case = TRUE,
max_dist = 2,
distance_col = "dist")
Result:
# A tibble: 10 x 6
hospital hospital_name adress district town dist
<chr> <chr> <chr> <chr> <chr> <dbl>
1 HOSPITAL PAPAR Hospital Papar Peti Surat No. 6, Papar Sabah 0
2 HOSPITAL PARIT BUNT~ Hospital Parit ~ Jalan Sempadan Parit Bun~ Perak 0
3 HOSPITAL PEKAN Hospital Pekan 26600 Pekan Pekan Pahang 0
4 HOSPITAL PENAWAR SD~ NA NA NA NA NA
5 HOSPITAL PORT DICKS~ Hospital Port D~ KM 11, Jalan Pantai Port Dick~ Negeri ~ 0
6 HOSPITAL PULAU PINA~ Hospital Pulau ~ Jalan Residensi Pulau Pin~ Pulau P~ 0
7 HOSPITAL PUSRAWI SD~ NA NA NA NA NA
8 HOSPITAL PUSRAWI SM~ NA NA NA NA NA
9 HOSPITAL PUTRAJAYA Hospital Putraj~ Pusat Pentadbiran Ker~ Putrajaya WP Putr~ 0
10 HOSPITAL QUEEN ELIZ~ NA NA NA NA NA

Related

Change data type of all columns in list of data frames before using `bind_rows()`

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?

You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

use mutate to create new variable where column has one variable based on condition within tidy tibble

I'm trying to create a new variable called cpi2000 that takes the cpi value of year 2000 for all observations in the series (i have four series hence the group_by) so that I can calculate an inflation adjustment factor. However, the following code only replaces the value for the year 2000 and leaves the other years as NA. Basically, I want there to be four numbers repeating in cpi2000, one for each series.
Here's the head of my data:
Groups: series_id [1]
year series_id value seasonal_adj series_name cpi2000
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 2000 CPIAUCSL 172. seasonally adjusted US city average, all items, seasonally adjusted 172.
2 2001 CPIAUCSL 177. seasonally adjusted US city average, all items, seasonally adjusted NA
3 2002 CPIAUCSL 180. seasonally adjusted US city average, all items, seasonally adjusted NA
4 2003 CPIAUCSL 184 seasonally adjusted US city average, all items, seasonally adjusted NA
5 2004 CPIAUCSL 189. seasonally adjusted US city average, all items, seasonally adjusted NA
6 2005 CPIAUCSL 195. seasonally adjusted US city average, all items, seasonally adjusted NA
>
cpi_values_tidy_clean <- cpi_values_tidy %>%
separate(date,
into = c("year"),
sep = "-",
extra = "drop") %>% # separate NAM into three variables
group_by(series_id) %>%
mutate(cpi2000 = if_else(year == 2000, value, value[2000])) %>%
glimpse()
Here's the output:
[1] 172.192 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 172.200 NA NA NA NA NA NA NA NA NA NA NA NA NA
[36] NA NA NA NA NA NA NA 165.717 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 165.725 NA NA NA NA NA NA
[71] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I figured the best way to do this was with an if_else statement (case_when didn't seem to work). This would work if I could figure out how to get the second argument in the if_else statement ("value[2000]) to take value when year == 2000 as well, but I can't figure out how to specify a condition on the second statement.
The end goal is create two variables cpi2000 and cpi2019 so I can create a third variable cpi_adj = (cpi2019/cpi2000) that can be used as an inflation factor.
Any help would be greatly appreciated.

I realized I can specify the year in the second condition value[year == 2000] instead of subsetting the bracket position like i was with value[2000]. Subsetting the using 2000 produced the "NA's" because there wasn't a 2000th row, instead I would use value[1] because I want the first value. Alternatively, filtering on year is safer because it allows me to specify the year I want. Here's the code and output below I settled with:
cpi_values_tidy_clean <- cpi_values_tidy %>%
separate(date,
into = c("year"),
sep = "-",
extra = "drop") %>% # separate NAM into three variables
group_by(series_id) %>%
mutate(cpi2000 = if_else(year == 2000, value, value[year == 2000])) %>%
mutate(cpi2019 = if_else(year == 2019, value, value[year == 2019])) %>%
glimpse()
head(cpi_values_tidy_clean)
year series_id value seasonal_adj series_name cpi2000 cpi2019
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 2000 CPIAUCSL 172. seasonally adjusted US city average, all items, seasonally adjusted 172. 256.
2 2001 CPIAUCSL 177. seasonally adjusted US city average, all items, seasonally adjusted 172. 256.
3 2002 CPIAUCSL 180. seasonally adjusted US city average, all items, seasonally adjusted 172. 256.
4 2003 CPIAUCSL 184 seasonally adjusted US city average, all items, seasonally adjusted 172. 256.
5 2004 CPIAUCSL 189. seasonally adjusted US city average, all items, seasonally adjusted 172. 256.
6 2005 CPIAUCSL 195. seasonally adjusted US city average, all items, seasonally adjusted 172. 256.
If anyone knows how to do this more elegantly or with case_when, I'd love to see it.

R Filling missing values with NA for a data frame

I am currently trying to create a data-frame with the following lists
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
When I try and create a data-frame I get the (obvious) error that the lengths of the lists are not equal. I want to find a way to either make the lists the same length, or fill the missing data-frame entries with "NA". After doing some searching I have not been able to find a solution

Here are purrr (part of tidyverse) and base R solutions, assuming you just want to fill remaining values in each list with NA. I'm taking the maximum length of any list as len, then for each list doing rep(NA) for the difference between the length of that list and the maximum length of any list.
library(tidyverse)
location <- list("USA","Singapore","UK")
organization <- list("Microsoft","University of London","Boeing","Apple")
person <- list()
date <- list("1989","2001","2018")
Jobs <- list("CEO","Chairman","VP of sales","General Manager","Director")
all_lists <- list(location, organization, person, date, Jobs)
len <- max(lengths(all_lists))
With purrr::map_dfc, you can map over the list of lists, tack on NAs as needed, convert to character vector, then get a data frame of all those vectors cbinded in one piped call:
map_dfc(all_lists, function(l) {
c(l, rep(NA, len - length(l))) %>%
as.character()
})
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
In base R, you can lapply the same function across the list of lists, then use Reduce to cbind the resulting lists and convert it to a data frame. Takes two steps instead of purrr's one:
cols <- lapply(all_lists, function(l) c(l, rep(NA, len - length(l))))
as.data.frame(Reduce(cbind, cols, init = NULL))
#> V1 V2 V3 V4 V5
#> 1 USA Microsoft NA 1989 CEO
#> 2 Singapore University of London NA 2001 Chairman
#> 3 UK Boeing NA 2018 VP of sales
#> 4 NA Apple NA NA General Manager
#> 5 NA NA NA NA Director
For both of these, you can now set the names however you like.

You could do:
data.frame(sapply(dyem_list, "length<-", max(lengths(dyem_list))))
location organization person date Jobs
1 USA Microsoft NULL 1989 CEO
2 Singapore University of London NULL 2001 Chairman
3 UK Boeing NULL 2018 VP of sales
4 NULL Apple NULL NULL General Manager
5 NULL NULL NULL NULL Director
Where dyem_list is the following:
dyem_list <- list(
location = list("USA","Singapore","UK"),
organization = list("Microsoft","University of London","Boeing","Apple"),
person = list(),
date = list("1989","2001","2018"),
Jobs = list("CEO","Chairman","VP of sales","General Manager","Director")
)

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?

We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement

I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

Split text string into column based on variable

I have a dataframe with a text column that I would like to split into multiple columns since the text string contains multiple variables, such a location, education, distance etc.
Dataframe:
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
df = data.frame(text.string)
df
text.string
1 &location=NY&distance=30&education=University
2 &location=CA&distance=30&education=Highschool&education=University
3 &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
I can split this using cSplit: cSplit(df, 'text.string', sep = "&"):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1: NA location=NY distance=30 education=University NA NA
2: NA location=CA distance=30 education=Highschool education=University NA
3: NA location=MN distance=10 industry=Healthcare NA NA
4: NA location=VT distance=30 education=University industry=IT industry=Business
Problem is that the text string may contain a multiple of the same variable, or some miss a certain variable. With cSplit the grouping of the variables per column become all mixed up. I would like to avoid this, and group them together.
So it would like similar to this (education and industry do not appear in multiple columns anymore):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1 NA location=NY distance=30 education=University <NA> NA
2 NA location=CA distance=30 education=Highschool education=University <NA> NA
3 NA location=MN distance=10 <NA> industry=Healthcare NA
4 NA location=VT distance=30 education=University industry=IT industry=Business NA

Taking into account #NicE comment:
This is one way, following your example:
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business

Categories

HOME

microcontroller

django-cms

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

approximate character matching using R - r

Related

Change data type of all columns in list of data frames before using `bind_rows()`

use mutate to create new variable where column has one variable based on condition within tidy tibble

R Filling missing values with NA for a data frame

Having trouble merging/joining two datasets on two variables in R

Split text string into column based on variable

Categories

Resources