Related
I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.
tail(df$states_county_MSA,n=10)
[1] "Iowa Polk Des Moines"
[2] "Mississippi Hinds Jackson"
[3] "Georgia Richmond Augusta-Richmond"
[4] "Ohio Mahoning Youngstown-Warren-Boardman"
[5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[6] "Pennsylvania Dauphin Harrisburg-Carlisle"
[7] "Florida Brevard Palm Bay-Melbourne-Titusville"
[8] "Utah Utah Provo-Orem"
[9] "Tennessee Hamilton Chattanooga"
[10] "North Carolina Durham Durham"
Modifying the solution by #jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.
library(tidyverse)
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df %>%
mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).
> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found
I was thinking of using the word function but names of MSAs are not standard either (one or more words).
Would anyone know a way to split the county-MSA in an efficient manner ?
EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.
[1] "Virginia Richmond Richmond 1,210,063 8 2014"
[2] "Louisiana Orleans New Orleans-Metairie-Kenner 1,195,794"
[3] "North Carolina Wake Raleigh-Cary 1,137,346 6 2014"
[4] "New York Erie Buffalo-Niagara Falls 1,135,342"
[5] "Alabama Jefferson Birmingham-Hoover 1,129,034"
[6] "Utah Salt Lake Salt Lake City 1,091,432 5 2014"
[7] "New York Monroe Rochester 1,080,082"
[8] "Michigan Kent Grand Rapids-Wyoming 989,205 7 2014"
[9] "Arizona Pima Tucson 981,935 10 2013"
[10] "Hawaii Honolulu Honolulu 956,336 8 2013"
I think this should work:
data <- tibble::tribble(~state_county_msa,
"Iowa Polk Des Moines" ,
"Mississippi Hinds Jackson" ,
"Georgia Richmond Augusta-Richmond" ,
"Ohio Mahoning Youngstown-Warren-Boardman" ,
"Pennsylvania Lackawanna Scranton--Wilkes-Barre",
"Pennsylvania Dauphin Harrisburg-Carlisle" ,
"Florida Brevard Palm Bay-Melbourne-Titusville" ,
"Utah Utah Provo-Orem" ,
"Tennessee Hamilton Chattanooga" ,
"North Carolina Durham Durham")
state_county <- ggplot2::map_data("county") %>%
select(state = region,
county = subregion) %>%
as_tibble() %>%
mutate(across(everything(),str_to_title)) %>%
unite(state_county, c("state","county"), sep = " ", remove = FALSE) %>%
distinct(state_county, .keep_all = TRUE)
state_county_string <- paste(state_county$state_county, collapse = "|")
data %>%
mutate(state_county = str_extract(state_county_msa, state_county_string),
msa = str_trim(str_remove(state_county_msa, state_county_string))) %>%
left_join(state_county, by = "state_county") %>%
select(state, county, msa)
Output:
# A tibble: 10 × 3
state county msa
<chr> <chr> <chr>
1 Iowa Polk Des Moines
2 Mississippi Hinds Jackson
3 Georgia Richmond Augusta-Richmond
4 Ohio Mahoning Youngstown-Warren-Boardman
5 Pennsylvania Lackawanna Scranton--Wilkes-Barre
6 Pennsylvania Dauphin Harrisburg-Carlisle
7 Florida Brevard Palm Bay-Melbourne-Titusville
8 Utah Utah Provo-Orem
9 Tennessee Hamilton Chattanooga
10 North Carolina Durham Durham
Explanation
I'm trying to write a function which has to find the lowest number in one of this columns and return hospital name.
"Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack"
"Hospital.30.Day.Death..Mortality..Rates.from.Heart.Failure"
"Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia"
I cannot understand why my results are not the same like in samples from PDF. Please hold in mind that I'm a fresh R programmer.
Example
For best("TX", "heart attack") function should return "CYPRESS FAIRBANKS MEDICAL CENTER". While my function returns: (Pay attention that correct result isn't even in this vector)
[1] "HEREFORD REGIONAL MEDICAL CENTER"
[2] "EAST TEXAS MEDICAL CENTER MOUNT VERNON"
[3] "ALLEGIANCE SPECIALTY HOSPITAL OF KILGORE"
[4] "KNOX COUNTY HOSPITAL"
[5] "EAST TEXAS MEDICAL CENTER TRINITY"
[6] "COMMUNITY GENERAL HOSPITAL"
[7] "KELL WEST REGIONAL HOSPITAL"
[8] "GOOD SHEPHARD MEDICAL CENTER-LINDEN"
[9] "BURLESON ST JOSEPH HEALTH CENTER"
[10] "MCCAMEY HOSPITAL"
[11] "FISHER COUNTY HOSPITAL DISTRICT"
[12] "HANSFORD COUNTY HOSPITAL"
[13] "ST LUKES LAKESIDE HOSPITAL"
Code
best <- function(state, outcome) {
file <- read.csv("outcome-of-care-measures.csv")
data <- file[file$State == state, ]
if (outcome == "heart attack") {
number <- 15 #column number
} else if (outcome == "heart failure") {
number <- 21
} else if (outcome == "pneumonia") {
number <- 27
}
col <- as.numeric(data[, number]) #column data
lowest <- min(col, na.rm = TRUE) #lowest number
data$Hospital.Name[data[, number] == lowest] #result
}
Sources
Data I work with
PDF with instructions Check point 2.
I'm gonna public solution, after hour of searching I found it! In firs steps I accidentally write wrong column numbers from documentation.
Column numbers are incorrect.
Solution
Simply change wrong column numbers (15, 21, 27) to (11, 17, 23)
Thanks
Thank you for your answers, it increased my knowledge. Have a nice weekend.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
complete dataset link : https://drive.google.com/open?id=12u0Ql1z5T2lzCXRVjp75i9ke9mNYrCWv
In this you can see general motors are not counted together as they are in different category. Like this many more manufacturer's are there. I want to group them together like General Motors. How can I group them together using nlp in r?
Try this way to achieve your goal:
Your Input data.frame:
Vehicle_Manufacturer<-c("GENERAL MOTORS CORP.","FORD MOTOR COMPANY","CHRYSLER CORPORATION","PACCAR INCORPORATED","MACK TRUCKS, INCORPORATED","FOREST RIVER, INC.","BLUE BIRD BODY COMPANY","DAIMLER TRUCKS NORTH AMERICA","GENERAL MOTORS LLC","HONEYWELL INTERNATIONAL, INC.","WINNEBAGO INDUSTRIES, INC.","BMW OF NORTH AMERICA, LLC","NISSAN NORTH AMERICA, INC.","NAVISTAR INTL CORP.","INTERNATIONAL TRUCK AND ENGINE","FREIGHTLINER LLC","HONDA (AMERICAN HONDA MOTOR CO.)","NEWMAR CORPORATION","NAVISTAR, INC","INTERNATIONAL TRUCK & ENGINE CORPORATION","PIERCE MANUFACTURING","GULF STREAM COACH, INC.","FLEETWOOD ENTERPRISES, INC.","FREIGHTLINER CORPORATION","DAIMLER TRUCKS NORTH AMERICA LLC","PACCAR, INCORPORATED","WHITE MOTOR CORPORATION","BAYERISCHE MOTOREN WERKE","THOMAS BUILT BUSES, INC.","DAIMLERCHRYSLER CORPORATION","VOLKSWAGEN OF AMERICA,INC","SPARTAN MOTORS, INC.","VOLVO TRUCKS NORTH AMERICA INC","TOYOTA MOTOR ENGINEERING & MANUFACTURING","PREVOST CAR, INCORPORATED","CHAMPION BUS, INC.","ALTEC INDUSTRIES INC.","SABERSPORT","MERCEDES-BENZ USA, LLC.","HARLEY-DAVIDSON MOTOR COMPANY","COOPER TIRE & RUBBER CO.","KEYSTONE RV COMPANY","SUBARU OF AMERICA, INC.","CHRYSLER (FCA US LLC)","MONACO COACH CORPORATION","CHRYSLER GROUP LLC","JAYCO, INC.","MITSUBISHI FUSO TRUCK OF AMERICA, INC.","COLLINS BUS CORPORATION","PRO-A MOTORS, INC.","NAVISTAR, INC.")
Recalls<-c(6228,5403,2787,2317,1988,1903,1898,1737,1620,1558,1353,1297,1174,1130,1055,987,985,980,955,950,925,922,918,896,835,824,818,801,797,794,749,731,724,709,694,669,641,623,616,613,599,586,582,578,578,572,569,568,559,549,511)
df<-data.frame(Vehicle_Manufacturer,Recalls)
Using package stringdist find similar strings between Vehicle_Manufacturer, in this example using Jaro-Winkler distance:
dist_matrix<-stringdistmatrix(as.character(df[,1]),as.character(df[,1]),method="jw")
Find a threshold under that similar strings are grouped, like this:
thr<-quantile(dist_matrix,probs=0.025) #2.5% quantile
Find strings to merge (in this example a for-loop but if you have a lot of data a lapply solution is better)
to_merge<-NULL
for(i in 1:nrow(df))
{
to_merge[[i]]<-Vehicle_Manufacturer[dist_matrix[i,]<thr]
}
Your output will be in to_merge list
To see only possible merge:
to_merge[sapply(to_merge, length) > 1]
[[1]]
[1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"
[[2]]
[1] "PACCAR INCORPORATED" "PACCAR, INCORPORATED"
[[3]]
[1] "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"
[[4]]
[1] "DAIMLER TRUCKS NORTH AMERICA" "DAIMLER TRUCKS NORTH AMERICA LLC"
[[5]]
[1] "GENERAL MOTORS CORP." "GENERAL MOTORS LLC"
[[6]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
[[7]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
[[8]]
[1] "DAIMLER TRUCKS NORTH AMERICA" "DAIMLER TRUCKS NORTH AMERICA LLC"
[[9]]
[1] "PACCAR INCORPORATED" "MACK TRUCKS, INCORPORATED" "PACCAR, INCORPORATED"
[[10]]
[1] "NAVISTAR INTL CORP." "NAVISTAR, INC" "NAVISTAR, INC."
I have a dataset which corresponding of Zipcode along with lat and log.I want to find out list of hospital/bank(within 2km) from that latitude and longitude.
How to do it?
The Long/Lat data looks like
store_zip lon lat
410710 73.8248981 18.5154681
410209 73.0907 19.0218215
400034 72.8148177 18.9724162
400001 72.836334 18.9385352
400102 72.834424 19.1418961
400066 72.8635299 19.2313448
400078 72.9327444 19.1570343
400078 72.9327444 19.1570343
400007 72.8133825 18.9618411
400050 72.8299518 19.0551695
400062 72.8426858 19.1593396
400083 72.9374227 19.1166191
400603 72.9781047 19.1834148
401107 72.8929 19.2762702
401105 72.8663173 19.3053477
400703 72.9992013 19.0793547
401209 NA NA
401203 72.7983705 19.4166761
400612 73.0287209 19.1799265
400612 73.0287209 19.1799265
400612 73.0287209 19.1799265
If your Points of Interest are unknown and you need to find them, you can use Google's API through my googleway package (as you've suggested in the comments). You will need a valid API key for this to work.
As the API can only accept one request at a time, you'll need to iterate over your data one row at a time. For that you can use whatever looping method you're most comforatable with
library(googleway) ## using v2.4.0 on CRAN
set_key("your_api_key")
lst <- lapply(1:nrow(df), function(x){
google_places(search_string = "Hospital",
location = c(df[x, 'lat'], df[x, 'lon']),
radius = 2000)
})
lst is now a list that contains the results of the queries. For example, the names of the hospitals it has returned for the first row of your data is
place_name(lst[[1]])
# [1] "Jadhav Hospital"
# [2] "Poona Hospital Medical Store"
# [3] "Sanjeevan Hospital"
# [4] "Suyash Hospital"
# [5] "Mehta Hospital"
# [6] "Deenanath Mangeshkar Hospital"
# [7] "Sushrut Hospital"
# [8] "Deenanath Mangeshkar Hospital and Research Centre"
# [9] "MMF Ratna Memorial Hospital"
# [10] "Maharashtra Medical Foundation's Joshi Multispeciality Hospital"
# [11] "Sahyadri Hospitals"
# [12] "Deendayal Memorial Hospital"
# [13] "Jehangir Specialty Hospital"
# [14] "Global Hospital And Research Institute"
# [15] "Prayag Hospital"
# [16] "Apex Superspeciality Hospital"
# [17] "Deoyani Multi Speciality Hospital"
# [18] "Shashwat Hospital"
# [19] "Deccan Multispeciality Hardikar Hospital"
# [20] "City Hospital"
You can also view them on a map
set_key("map_api_key", api = "map")
## the lat/lon of the returned results are found through `place_location()`
# place_location(lst[[1]])
df_hospitals <- place_location(lst[[1]])
df_hospitals$name <- place_name(lst[[1]])
google_map() %>%
add_circles(data = df[1, ], radius = 2000) %>%
add_markers(data = df_hospitals, info_window = "name")
Note:
Google's API is limited to 2,500 queries per key per day, unless you pay for a premium account.
I am attempting to try and find if a string called "values" contains substrings from two different lists. This is my current code:
for (i in 1:length(value)){
for (j in 1:length(city)){
if (str_detect(value[i],(city[j]))) == TRUE){
for (k in 1:length(school)){
if (str_detect(value[i],(school[j]))) == TRUE){
...........................................................
}
}
}
}
}
city and school are separate vectors of different length, each containing string elements.
city <- ("Madrid", "London", "Paris", "Sofia", "Cairo", "Detroit", "New York")
school <- ("Law", "Mathematics", "PoliSci", "Economics")
value <- ("Rey Juan Carlos Law Dept, Madrid", "New York University, Center of PoliSci Studies", ..........)
What I want to do is see if value contains some combination of elements from both lists to later work with that. Can this be done in a single step: something like this:
for (i in 1:length(value)){
if (str_detect(value[i],(city[j]))) == TRUE && str_detect(value[i],(school[j]))) == TRUE){
.............................................
}
}
Try this:
library("stringr")
city <- c("Madrid", "London", "Paris", "Sofia", "Cairo", "Detroit", "New York")
school <- c("Law", "Mathematics", "PoliSci", "Economics")
value <- c(
"Rey Juan Carlos Law Dept, Madrid",
"New York University, Center of PoliSci Studies",
"Los Angeles, CALTECH",
"London, Physics",
"London, Mathematics"
)
for (v in value)
{
if (sum(str_detect(v, city)) > 0 & sum(str_detect(v, school)) > 0)
{
print (v)
}
}
when executed it will print those which have common element with city and school:
[1] "Rey Juan Carlos Law Dept, Madrid"
[1] "New York University, Center of PoliSci Studies"
[1] "London, Mathematics"
This problem is similar one I have been working on. For my purposes, having a dataframe returned that retains the structure of the original input is needed.
This may be true for you too. I have thus amended the excellent solution from #rbm as follows:
library("stringr")
cityList <- c("Madrid", "London", "Paris", "Sofia", "Cairo", "Detroit", "New York")
schoolList <- c("Law", "Mathematics", "PoliSci", "Economics")
valueList <- c(
"Rey Juan Carlos Law Dept, Madrid",
"New York University, Center of PoliSci Studies",
"Los Angeles, CALTECH",
"London, Physics",
"London, Mathematics"
)
df <- data.frame(value, city=NA, school=NA, stringsAsFactors = FALSE)
i = 0
for (v in value)
{
i = i + 1
if (sum(str_detect(v, cityList)) > 0 & sum(str_detect(v, schoolList)) > 0)
{
df$city[i] <- schoolList[[which(str_detect(v, schoolList))]]
df$school[i] <- cityList[[which(str_detect(v, cityList))]]
} else {
df$city[i] <- ""
df$school[i] <- ""
}
}
print(df)
This results in the following:
value city school
1 Rey Juan Carlos Law Dept, Madrid Law Madrid
2 New York University, Center of PoliSci Studies PoliSci New York
3 Los Angeles, CALTECH
4 London, Physics
5 London, Mathematics Mathematics London