Finding MSA for [Cityname, Statename]? - r

I'm looking for an R package that helps me to find the respective Metropolitan Statistical Areas (MSA) for input data in the form of [Cityname, State Abbreviation]. For instance: "New York, NY", "San Francisco, CA". I do not have: County name, ZIP Code, FIPS, or anything else.
What I found:
MSA to County Relationships (2015) are provided by the U.S. Census, as "Core based statistical areas (CBSAs), metropolitan divisions, and combined statistical areas (CSAs)". A "Principal cities of metropolitan and micropolitan statistical areas" (2015) list is available on the same page. In worst case, I imagine, one could take the second list, attach the state code to the "Principal City Name" field and then match the corresponding string with an MSA.
Before trying that, it would be good to know whether this problem has been solved already? I did not find the desired function in the noncensus or USCensus2010 package.
Question therefore: Do you know a package that matches (Principal) City with MSA?
Thanks!

Related

R - How to get/print description of the variables in a dataset loaded from any package / library (eg: ISLR)?

Is there a function that can print the description / detailed info (what does a variable represent, what are it's units etc.) about the variables that are part of a dataset loaded from a library or package?
Note: I am using jupyter notebook.
Is there a way to check if a dataset has in-built info?
I have loaded the datasets from the library (ISLR) for the book "Introduction to Statistical Learning with R."
I want to see the description of the variables included in the 'College' dataset like : Top10perc, Outstate, etc.
# load library
library(ISLR)
df = College # saved data with a generic name 'df'
For ex:
(got this description from a pdf for the package ISLR)
__College__
U.S. News and World Report’s College Data
__Description__
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
__Format__
A data frame with 777 observations on the following 18 variables.
Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates
P.Undergrad Number of parttime undergraduates
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate
They are typically documented in help files. For example, to get the documentation on the Auto data set:
library(ISLR)
?Auto
This will show all of the help files and you can click through to get more information.
help(packages = "ISLR")
Alternately, the help files are assembled into a Reference Manual and that can be readily accessed on the CRAN home page of the package, e.g. https://cran.r-project.org/package=ISLR
This should help you
```{r}
library(ISLR)
?ISLR
```

How to clean abbreviations containing a "period-punctuation" ("e.g.", "st.", "rd.") but leave the "." at the end of a sentence?

I am working on a sentence-level LDA in R and am currently trying to split my text data into individual sentences with the sent_detect() function from the openNLP package.
However, my text data contains a lot of abbreviations that have a "period symbol" but do not mark the end of a sentence. Here are some examples: "st. patricks day", "oxford st.", "blue rd.", "e.g."
Is there a way to create a gsub() function to account for such 2-character abbreviations and remove their "."-symbol so that it is not wrongly detected by the sent_detect() function? Unfortunately, these abbreviations are not always in between two words but sometimes they could indeed also mark the end of a sentence:
Example:
"I really liked Oxford st." - the "st." marks the end of a sentence and the "." should remain.
vs
"Oxford st. was very busy." - the "st." does not stand at the end of a sentence, thus, the "."-symbol should be replaced.
I am not sure whether there is a solution for this, but maybe someone else who is more familiar with sentence-level analysis knows a way of how to deal with such issues.
Thank you!
Looking at your previously asked questions, I would suggest looking into the textclean package. A lot of what you want has been included in that package. Any missing functions can be appropriated or reused or expanded upon.
Just replacing "st." with something is going to lead to problems as it could mean street or saint, but "st. patricks day" is easy to find. The problem what you will have is to make a list of possible occurences and find alternatives for them. The easiest to use are translation tables. Below I create a table for a few abbreviations and their expected long names. Now it is up to you (or your client) to specify what you want as an end result. The best way is to create a table in excel or database and load this into a data.frame (and store somewhere for easy access). Depending on your text this might be a lot of work, but it will improve the quality of your outcome.
Example:
library(textclean)
text <- c("I really liked Oxford st.", "Oxford st. was very busy.",
"e.g. st. Patricks day was on oxford st. and blue rd.")
# Create abbreviations table, making sure that we are looking for rd. and not just rd. Also should it be road or could it mean something else?
abbreviations <- data.frame(abbreviation = c("st. patricks day", "oxford st.", "rd\\.", "e.g."),
replacement = c("saint patricks day","oxford street","road", "eg"))
# I use the replace_contraction function since you can replace the default contraction table with your own table.
text <- replace_contraction(text, abbreviations)
text
[1] "I really liked oxford street" "oxford street was very busy."
[3] "eg saint patricks day was on oxford street and blue road"
# as the result from above show missing end marks we use the following function to add them again.
text <- add_missing_endmark(text, ".")
text
[1] "I really liked oxford street." "oxford street was very busy."
[3] "eg saint patricks day was on oxford street and blue road."
textclean has a range of replace_zzz functions, most are based on the mgsub function that is in the package. Check the documentation with all the functions to get an idea of what they do.

Matching postcode to index of deprivation

I have a list of UK postcodes in my dataset, and I would like to convert them to their deprivation index. This website does it http://imd-by-postcode.opendatacommunities.org/imd/2019 but I need it to be done in R, rather than manually entering 1000s postcodes individually.
Does anyone have any experience/idea of a package that does this?
Many thanks
The Office for National Statistics has some lookup tables that match postcodes to various scales of outputs areas, for example: Postcode to Output Area
Hopefully you can find a common field to merge by.

How to extract the acronym specific for my data with UMLS

I am new to the United Medical Language System. I would like to annotate some text that gives reports about endoscopy examinations. The terminology is therefore specific to gastroenterology. Some of the text contains acronyms like TI which would mean terminal ileum. However according to UMLS TI also stands for a number of other terms that are non-gastroenterological. I would like to build a gastroenterology lexicon only from UMLS terms. Is there a way to do this?

Trying to use regex (or R) to turn press releases into a dataset

I'm working on a project to turn press releases from Operation Inherent Resolve which detail airstrikes against ISIS in Syria and Iraq into a usable dataset. So far we've been handcoding everything but it just takes insanely long.
Every press release is structured like this:
November 23, 2016
Military Strikes Continue Against ISIL Terrorists in Syria and Iraq
U.S. Central Command
SOUTHWEST ASIA, November 23, 2016 - On Nov. 22, Coalition military forces conducted 17 strikes against ISIL terrorists in Syria and Iraq. In Syria, Coalition military forces conducted 11 strikes using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets. Additionally in Iraq, Coalition military forces conducted six strikes coordinated with and in support of the Government of Iraq using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets.
The following is a summary of the strikes conducted since the last press release:
Syria
Near Abu Kamal, one strike destroyed an oil rig.
Near Ar Raqqah, four strikes engaged an ISIL tactical unit, destroyed two vehicles, an oil tanker truck, an oil pump, and a VBIED, and damaged a road.
Iraq
Near Rawah, one strike engaged an ISIL tactical unit and destroyed a vehicle, a mortar system, and a weapons cache.
Near Mosul, four strikes engaged three ISIL tactical units, destroyed >six ISIL-held buildings, a mortar system, a vehicle, a weapons cache, a supply cache, and an artillery system, and damaged five supply routes, and a bridge.
more text I don't need, about 5 exceptions where they amend previous reports I'll just fix by hand, and then the next report
What I'm trying to do is pull out just the date of the strike and how many strikes per city for both Iraq and Syria and reformat that information into a proper dataset organized as one row per date, like this:
Rawah Mosul And So On
1/1/2014 1 4
1/2/2014 2 5
The bad: There's a different number of cities listed for each country in each press release, and a different number of strikes listed each time.
The good: Everything one of these press releases is worded exactly the same.
The string "SOUTHWEST ASIA," is always in front of the date
A 4 space indent followed by the word "Near" are always in front of the city
The city and a comma are always in front of the number of strikes
The number of strikes are always in front of the word "airstrike" or "airstrikes"
The question is whether it's possible to make a regex to either copy/cut everything matching those criteria in order or just delete everything else. I think to grab the arbitrary number of cities (with unknown names) and unknown numbers of strikes it would have to be based on copying/saving everything next to the unchanging markers.
I've tried using notepad++'s find/replace function with something like *(foobar)* but I can only match one thing at a time and when I try to replace everything but the matched string it just deletes the whole file instead of protecting every instance of the matching string.
I suggest searching by using Near (.*?),. You can back-reference with \1.
I did a quick scan of the documents, and it seems the more recent ones change a bit of the format, adding "Strikes at [country]" rather than your example of just "[country]". But each one lists the city in a Near [city], format.
This would grab you the cities, of course, but you would have to do some pretty hacky things to get the number of strikes, since there doesn't seem to be a standard for that.
If you are only dealing with the records that have your formatting, try Near (.*?), (.*? ) and you should get the spelled out number of strikes per city by referencing \2, and the city by referencing \1.
So, if you were to find and replace in Notepad++, you would use something like .*Near (.*?), (.*? ).* as your find, and something like \1 -- \2 as your replace. From there you would need to draft up a small script to translate the spelled numbers to digits, and output those where they need to go. The pattern \w* \d{1,2}, \d{4} will match a date in the long format, something else you could pipe into a python script or something to construct your table of data. Sorry I couldn't help more there!

Resources