Grouping two data frames using stringdist_join - r

I am currently working on a project and have reached a problem... I am trying to match two data frames based on a candidate's name. I have managed to do this, however with anything more than a max_dist of 2 I start to get duplicate entries. However, these would be easily avoided if I could 'group' the candidates by race (state and district) before running stringdist_join as there are only a few candidates in each race with very little chance of having two candidates with similar names.
The goal is to obtain a table called tmpJoin where I can have both the candidateID and the canVotes, along with the name, state, district.
Any suggestions would be greatly appreciated!
Below is my code as well as a replication of the two datasets
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Bevill','Faye Baggiano','Thomas
Bevill','Don Sledge','William Turner', 'Bill Turner','Ed Smith','Tom
Bevill','Edward Smith')
canVotes <-c('234','589','9234','729','149','245','879','385','8712','7099')
yearHouseResult <- data.frame(state, district, FullName,canVotes)
state <- c('AL','AL','AL','AL','AL','NY','NY','NY','NY','NY')
district <-c('01','02','02','03','01','01','02','01','02','02')
FullName <-c('Sonny Callahan','Tom Beville','Faye Baggiano','Thom Bevill','Donald Sledge','Bill Turner', 'Bill Turner','Ed Smith','Tom Bevill','Ed Smith')
candidateID <- c('1','2','3','4','5','6','7','8','9','10')
congrCands <- data.frame(state, district, FullName, candidateID)
tmpJoin <- stringdist_join(congrCands, yearHouseResult,
by = "FullName",
max_dist=2,
method = "osa",
ignore_case = FALSE,
distance_col = "matchingDistance")

You can test all three conditions with fuzzy_inner_join, also from the fuzzyjoin package.
First I had to change the factors into numerics and characters, because different factor levels will mess with the function.
Some information to the fuzzy_join. In argument match_fun is the description of the three conditions and in by the columns for the conditions are specified.
stringdist < 4 for FullName
district must be equal
state must be equal (district is a numeric, state is a character, therefore two different functions are needed to compare these columns)
The table includes more columns than you need. So you might select the needed columns. I just thought it would be easier to controll the matches this way.
yearHouseResult <- data.frame(state, district, FullName,canVotes) %>%
mutate(state = as.character(state),
district = as.numeric(district),
FullName = as.character(FullName))
congrCands <- data.frame(state, district, FullName, candidateID) %>%
mutate(state = as.character(state),
district = as.numeric(district),
FullName = as.character(FullName))
t <- fuzzy_inner_join(congrCands, yearHouseResult,
match_fun = list(function(x,y) stringdist(x,y,
method="osa") < 4,
`==`,
function(x,y) str_detect(x,y)),
by = c( "FullName", "district", "state"))
If you increase the number of stringdist from 4 to 5 you will correctly match Ed/Edward Smith but incorrectly match William/Bill Turner. So you need to decide whats more important a clean match or more matches.

Related

Can't remove empty `character(0)` or `list()` values from R data frame

I have an R data frame that has character(0) and list() values inside the cells. I want to replace these with NA values.
In the following example, the field "teaser" has this issue, but it can be anywhere in the data frame.
df <- structure(list(body = "BAKER TO VEGAS 2022The Office fielded two squads this year in the 36th Annual Baker to Vegas (“B2V”) Challenge Cup Relay on April 9-10. Members of our 2022 B2V Team include many staff and AUSAs who were joined by office alums and a cadre of friends and family who helped out during some rather brutal conditions this year with temperatures around 100 degrees for much of the days. Most importantly, everyone had fun… and nobody got hurt! It was a great opportunity to meet (and run past) various members of our law enforcement community and to see the amazing logistics of the yearly event. Congratulations to all the participants.",
changed = structure(19156, class = "Date"), created = structure(19156, class = "Date"),
date = structure(19090, class = "Date"), teaser = "character(0)",
title = "Baker to Vegas 2022", url = "https://www.justice.gov/usao-cdca/blog/baker-vegas-2022",
uuid = "cd7e1023-c3ed-4234-b8af-56d342493810", vuuid = "8971702d-6f96-4bbd-ba8c-418f9d32a486",
name = "USAO - California, Central,"), row.names = 33L, class = "data.frame")
I've tried numerous things that don't work, including the following:
df <- na_if(df, "character(0)")
Error in charToDate(x) :
character string is not in a standard unambiguous format
Thanks for your help.
We could use
library(dplyr)
df %>%
mutate(across(where(is.character), ~ na_if(.x, "character(0)")))
Here is a base R way.
create a logical index taking the value TRUE when the columns are of class "character";
create an index list on those columns with lapply;
with mapply change the bad values to NA.
i_chr <- sapply(df, is.character)
inx_list <- lapply(df[i_chr], \(x) x == "character(0)")
df[i_chr] <- Map(\(x, i) {is.na(x) <- i; x}, df[i_chr], inx_list)

How to select the exact matches for a list of variables to append datasets

I have different datasets for different waves. Each wave has its own prefix for both dataset and variables names. I'm trying to import and append all the data files using a subset of variables I need. Therefore, I'm currently doing:
var_list <- c("pidp", "jbsat", "jbhrs", "jbnssec8_dv", "panssec8_dv", "manssec8_dv", "paedqf", "maedqf", "qfhigh", "age_dv",
"sex_dv", "psu", "strata", "employ", "jbhas", "jboff", "jbsem", "jbstat", "jbterm1", "jbterm2", "pjbptft", "fimnet_dv",
"fimngrs_dv", "fimnlabnet_dv", "seearnnet_dv", "fimnmisc_dv", "fimnprben_dv", "fimninvent_dv", "fimnpen_dv", "fimnsben_dv",
"hhtype_dv", "livesp_dv", "nch14resp", "nmpsp_dv", "tenure_dv", "urban_dv", "jbsat", "health", "sf1", "scghqa",
"scghqb", "scghqc", "scghqd", "scgqhe", "scgqhf", "scghqg", "scghqi", "scghqj", "scghqh", "scghql", "sclsat1",
"sclsat2", "sclsat3", "sclsat4", "indscus_lw", "indscub_xw")
Then import the data for the first wave, selecting these variables and remove the wave-prefix:
longfile <- read_dta(file=paste0(dir, "ukhls_w1/a_indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with("a_")), ~str_replace(.,"a_", "")) %>% #remove the wave prefix
mutate(wave = 1)
At this point, I would simply use the following loop:
for (wn in 2:10) {
wl <- paste0(letters[wn],"_")
wave_data <- read_dta(paste0(dir, "ukhls_w", wn, "/", wl, "indresp.dta")) %>%
select(matches(var_list)) %>%
rename_at(vars(starts_with(wl)), ~str_replace(.,wl, "")) %>% # remove prefix wave
mutate(wave = wn)
longfile <- rbind(longfile, wave_data)
}
However, the problem is that some variable names matches more than one column in the files for the subsequent waves. For example, in the second wave it exists a variable named "nxtjbhrs", therefore this will be included as it matches "jbhrs". This will create an error in the rbind as the number of columns will differ.
How can I select the exact matches in this case? Or force the appending of the dataset?
Thanks for your support!
select(setdiff(names(.), var_list))

Using tidyr::separate_rows on multiple connected rows

I have some data that I scrubbed from an offline source using some text recognition software. It looks something like the data below, but less Elvish.
elvish_ring_holders_unclean <- tibble(
name=c("Gandalf", "Galadriel", "Elrond", "Cirdan\n\nGil-Galad"),
city = c("Undying Lands","Lothlorien","Rivendell", "Mithlond\n\nLindon"),
race = c("Maiar", "Elf", "Elf", "Elf\n\nElf"))
The problem for both datasets is that certain rows have been concatenated together with spaces. What I would prefer is something like the data below with each observation having its own row
elvish_ring_holders <- tibble(
name=c("Gandalf", "Galadriel", "Elrond", "Cirdan","Gil-Galad"),
city = c("Undying Lands","Lothlorien","Rivendell", "Mithlond", "Lindon"),
race = c("Maiar", "Elf", "Elf", "Elf", "Elf"))
So far, I have tried a tidyr::separate_rows approach
elvish_ring_holders %>%
separate_rows(name, sep = "\n\n") %>%
separate_rows(city, sep = "\n\n") %>%
separate_rows(race, sep = "\n\n") %>%
distinct()
But, I end up with a dataset where Gil-Galad and Cirdan both have two observations with two different cities with one true city and one false city.
In my exterior data, my race variable also can duplicate in this way and the data has more observations. What I am looking for is some method of separating rows that can separate once across multiple cols.
Instead of separating each column on it's own do them all in one go.
elvish_ring_holders_unclean %>%
separate_rows(everything(), sep = "\n\n")
name
city
race
1
Gandalf
Undying Lands
Maiar
2
Galadriel
Lothlorien
Elf
3
Elrond
Rivendell
Elf
4
Cirdan
Mithlond
Elf
5
Gil-Galad
Lindon
Elf

Ho to run a function (many times) that changes variable (tibble) in global env

I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()
If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.
I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))
With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)

How to join Spatial data with Dataframe so it can be displayed with Tmap?

Short version: when executing the following command qtm(World, "amount") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(653989.801201595, : replacement has 177 rows, data has 175
Disclaimer: this is the same problem I used to have in this question, but if I'm not wrong, in that one the problem was that I had one variable on the left dataframe that matched to several variables on the right one, and hence, I needed to group variables on right dataframe. In this case, I am pretty sure that I do not have the same problem, as can be seen from the code below:
library(tmap)
library(tidyr)
# Read tmap's world map.
data("World")
# Load my dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/ad106eda807f710a6f331084ea091513/raw/dc9b51bfc73f09610f199a5a3267621874606aec/tmap.sample.dataframe.csv",
na = "")
# Compare the countries in df that do not match with World's
# SpatialPolygons.
df$iso_a3 %in% World$iso_a3
# Return rows which do not match
selected.countries = df$iso_a3[!df$iso_a3 %in% World$iso_a3]
df.f = filter(df, !(iso_a3 %in% selected.countries))
# Verification.
df.f$iso_a3[!df.f$iso_a3 %in% World$iso_a3]
World#data = World#data %>%
left_join(df.f, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
qtm(World, "amount")
My guess is that the clue may be the fact that the column I am using when joining both dataframes has different levels (hence it is converted to string), but I'm ashamed to admit that I still don't understand the error that I am having here. I'm assuming I have something wrong with my dataframe, although I have to admit that it didn't work even with a smaller dataframe:
selected.countries2 = c("USA", "FRA", "ITA", "ESP")
df.f2 = filter(df, iso_a3 %in% selected.countries2)
df.f2$iso_a3 = droplevels(df.f2$iso_a3)
World#data = World#data %>%
left_join(df.f2, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
World$iso_a3 = droplevels(World$iso_a3)
qtm(World, "amount")
Can anyone help me pointing out what's causing this error (providing an solution may also be much appreaciated)
Edited: It is again your data
table(df$iso_a3)

Resources