How can I convert FIPS Code to GISJOIN for census tracts - r

I have two datasets from two different agencies that report census tracts in two different ways, namely FIPSCode and GISJOIN.
I have to frequently interchange these two and hence looking for a way to see if anyone knows how to effectively do it as my dataset includes some 70,000 census tracts and doing it manually is out of question.
For example,
Fipscode 1073011803 = GISJOIN G0100730011803.
The logic is simple where
1 = 01 (state code)
073 = 0073 (county code)
011803 = 0011803 (census tract number)
It seems that padding with 0 for each of the three elements in a fipscode gives the GISJOIN, however, I am unsure how to convert it.
I am using excel but can work with R if there is a way.
Thank you for your time!

After giving it a few tries, I have found a solution to this.
We need to know about the GISJOIN a bit more! It is a geoidentifier unique to US census geographies and NHGIS provided a standard structure using a combination of 13 digit alphanumeric IDs.
For demonstration purposes, I selected five random census tracts from the HUD data with the fips2010 and converted them into the prescribed GISJOIN style.
data <- data.frame(State = c("Alabama", "Alabama", "Delaware",
"Texas", "Wisconsin"),
County = c("Jefferson County", "Montgomary County", "Kent County",
"Travis County", "Milwaukee County"),
Tract = c("118.03", "1.00", "433.00", "13.07", "86.00"),
fips2010 = c("1073011803", "1101000100", "10001043300",
"48453001307", "55079008600"))
print(data)
##Output
State County Tract fips2010
1 Alabama Jefferson County 118.03 1073011803
2 Alabama Montgomary County 1.00 1101000100
3 Delaware Kent County 433.00 10001043300
4 Texas Travis County 13.07 48453001307
5 Wisconsin Milwaukee County 86.00 55079008600
Following the logic established in the NHGIS documentation, the code below converts the fips2010 column to appropriate GISJOIN standard.
for (i in 1:nrow(data)) {
fips2010 <- data$fips2010[i]
if (nchar(fips2010) == 10) {
data$fips2010[i] <- paste0("G0", substr(fips2010, 1, 1), "0", substr(fips2010, 2, 4), "0", substr(fips2010, 5, 10))
} else if (nchar(fips2010) == 11) {
data$fips2010[i] <- paste0("G", substr(fips2010, 1, 2), "0", substr(fips2010, 3, 5), "0", substr(fips2010, 6, 11))
}
}
print(data)
##Output
State County Tract fips2010
1 Alabama Jefferson County 118.03 G0100730011803
2 Alabama Montgomary County 1.00 G0101010000100
3 Delaware Kent County 433.00 G1000010043300
4 Texas Travis County 13.07 G4804530001307
5 Wisconsin Milwaukee County 86.00 G5500790008600
A side by side comparison: fips2010 and GISJOIN
State County Tract fips2010 GISJOIN
1 Alabama Jefferson County 118.03 1073011803 G0100730011803
2 Alabama Montgomary County 1.00 1101000100 G0101010000100
3 Delaware Kent County 433.00 10001043300 G1000010043300
4 Texas Travis County 13.07 48453001307 G4804530001307
5 Wisconsin Milwaukee County 86.00 55079008600 G5500790008600
I hope this helps anyone dealing with a similar issue.

Related

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index
State_Name
Abbreviation
1
California
CA
2
Maryland
MD
3
New York
NY
4
Texas
TX
5
Virginia
VA
6
Washington
WA
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index
State
Article
1
NA
Texas governor, Abbott, signs new abortion bill
2
NA
Effort to recall California governor Newsome loses steam
3
NA
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
NA
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
NA
Amazon HQ2 causing housing prices to soar in northern Virginia
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index
State
Article
1
TX
Texas governor, Abbott, signs new abortion bill
2
CA
Effort to recall California governor Newsome loses steam
3
NY
New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4
MD
Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5
NA
DC statehood unlikely as Manchin opposes
6
VA
Amazon HQ2 causing housing prices to soar in northern Virginia
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!
You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.
Here's a way to do this in for loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
Not as concise as above but a Base R approach:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))

Lapply instead of for loop in r

I want to write the function ,that returns a 2-column data frame
containing the hospital in each state that has the ranking specified in num.
Rankall that takes two arguments: an outcome name (outcome) and a hospital ranking
(num). The function reads the outcome-of-care-measures.csv file and returns a 2-column data frame
containing the hospital in each state that has the ranking specified in num.
rankall <- function(outcome, num = "best") {
## Read outcome data
## Check that state and outcome are valid
## For each state, find the hospital of the given rank
## Return a data frame with the hospital names and the
## (abbreviated) state name
}
head(rankall("heart attack", 20), 10)
hospital state
AK <NA> AK
AL D W MCMILLAN MEMORIAL HOSPITAL AL
AR ARKANSAS METHODIST MEDICAL CENTER AR
4
AZ JOHN C LINCOLN DEER VALLEY HOSPITAL AZ
CA SHERMAN OAKS HOSPITAL CA
CO SKY RIDGE MEDICAL CENTER CO
CT MIDSTATE MEDICAL CENTER CT
DC <NA> DC
DE <NA> DE
FL SOUTH FLORIDA BAPTIST HOSPITAL FL
My function works correct, but the last step(formating 2-column data frame) I made by the following loop:
new_data <- vector()
for(i in sort(unique(d$State))){
new_data <- rbind(new_data,cbind(d$Hospital.Name[which(d$State == i)][num],i))
}
new_data <- as.data.frame(new_data)
It is correct, but i know, that it is possible to code the same loop by lapply function
My attempt is wrong:
lapply(d,function(x) x <-rbind(x,d$Hospital.Name[which(d$State == i)][num]))
How can I fix that?
I'm supposing your d data is already sorted:
new_data <- do.call(rbind,
lapply(unique(d$State),
function(state){
data.frame(State = state,
Hospital.Name = d$Hospital.Name[which(d$State==state)][num],
stringsAsFactors = FALSE)
}))

R: Using plyr to perform fuzzy string matching between matching subsets of two data sources

Say I have a list of counties with varying amounts of spelling errors or other issues that differentiate them from the 2010 FIPS dataset (code to create fips dataframe below), but the states in which the misspelled counties reside are entered correctly. Here's a sample of 21 random observations from my full dataset:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
I've used adist to create a fuzzy string matching algorithm that matches around 80% of my counties to the county names in fips. However, sometimes it will match two counties with similar spelling, but from different states (e.g., "Webster, LA" gets matched to "Webster, GA" rather than "Webster Parrish, LA").
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
Therefore, I want to restrict fuzzy string matching of county to the correctly spelled versions with matching state.
My current algorithm makes one big matrix which calculates standard Levenshtein distances between both sources and then selects the value with the minimum distance.
To solve my problem, I'm guessing I'd need to create a function that could be applied to each 'state' group by ddply, but I'm confused as to how I should indicate that the group value in the ddply function should match another dataframe. A dplyr solution or solution using any other package would be appreciated as well.
Code to create FIPS dataset:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
Here's a way with dplyr. I first join the tomatch data.frame with the FIPS names by state (allowing only in-state matches):
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
Next, I noticed that a lot of counties don't have 'Saint' but 'St.' in the FIPS dataset. Cleaning that up first should improve the results obtained.
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
Then, group this data.frame by county, and calculate the distance with adist:
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
Note that I took the diagonal from the resulting matrix as adist returns an n x m matrix with n representing the x vector and m representing the y vector (it calculates all of the combinations).
Optionally, you could add the agrep result:
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
Then filter as you did before, take the minimum distance:
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
Note that this could result in more than one row returned for each of the input rows in tomatch.
Alternatively, do it all in one run (I usually change code to this format once I'm confident it's doing what it's supposed to do):
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
The result in both cases:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU
Don't have example data but try something using agrep instead of adist and searching only the names in that state
sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)
You can use the max.distance argument in agrep to vary how close they need to match. Also, setting value=TRUE returns the value of the matched string rather than the location of the match.

R getting error of number of rows when using replace function

I'm trying to get rid of some specific words in a data frame column. So the data set looks like somewhat like this with 3235 rows:
V1 V2
AUTAUGA COUNTY 1
BALDWIN COUNTY 3
VALDEZ-CORDOVA CENSUS AREA 261
what I'm trying to do is:
data$V1 <- replace(data$V1, " COUNTY", "")
But I get an error that looks like this:
Error in `$<-.data.frame`(`*tmp*`, "V1", value = c("AUTAUGA COUNTY", :
replacement has 3236 rows, data has 3235
Am I using the function the wrong way? Or is there any other way to do this?
Thanks!
Hugo,
For the example you've provided, this code works well:
eg <- data.frame(V1 = c("AUTUAGA COUNTY", "BALDWIN COUNTY",
"VALDEZ-CORDOVA CENSUS AREA"),
V2 = c(1, 3, 261))
eg$gsub <- gsub(" COUNTY", "", eg$V1)
eg
- V1 V2 gsub
- 1 AUTUAGA COUNTY 1 AUTUAGA
- 2 BALDWIN COUNTY 3 BALDWIN
- 3 VALDEZ-CORDOVA CENSUS AREA 261 VALDEZ-CORDOVA CENSUS AREA
Does this resolve the error?
(Edited to fix the output column names.)

Isolating partial text in r data frame

I have an r data frame that contains U.S. state and county names in one column. The data is in the format:
United States - State name - County name
where each cell is a unique county. For example:
United States - North Carolina - Wake County
United States - North Carolina - Warren County
etc.
I need to break the column into 2 columns, one containing just the state name and the other containing just the county name. I've experimented with sub and gsub but am getting no results. I understand this is probably a simple matter for r experts but I'm a newbie. I would be most grateful if anyone can point me in the right direction.
You can use tidyr's separate function:
library(tidyr)
df <- separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
If the data is as you show in your question (including United States as country) and if your data frame is called df and the current column with the data is called currentColumn.
Example:
df <- data.frame(currentColumn = c("United States - North Carolina - Wake County",
"United States - North Carolina - Warren County"), val = rnorm(2))
df
# currentColumn val
#1 United States - North Carolina - Wake County 0.8173619
#2 United States - North Carolina - Warren County 0.4941976
separate(df, currentColumn, into = c("Country", "State", "County"), sep = " - ")
# Country State County val
#1 United States North Carolina Wake County 0.8173619
#2 United States North Carolina Warren County 0.4941976
Using read.table, and assuming your data is in df$var
read.table(text=df$var,sep="-",strip.white=TRUE,
col.names=c("Country","State","County"))
If speed is an issue, then strsplit will be a lot quicker:
setNames(data.frame(do.call(rbind,strsplit(df$var,split=" - "))),
c("Country","State","County"))
Both give:
# Country State County
#1 United States North Carolina Wake County
#2 United States North Carolina Warren County

Resources