To make groups according to addresses in demographic data - r

My data set is not in English but in Korean.
The number of observations is more than 3000.
The data set's name is demo.
str(demo)
This has information of each person in each row.
$ 거주지역: Factor w/ 900 levels "","강원 강릉시 포남1동",..: 595 235 595 832 12 126 600 321 600 589 ...
Above is the 4th column's structure of the data set.
I want to make groups according to 4th column which indicates addresses of people.
The problem is that the level of the factor is 900. This happens because the addresses are fully written.
I want to make groups to assign people in some provinces. So R needs to read the factors and identify the letters to make groups.
How can I do this? Please give me a help. I googled it for so much time but I could not find it.

Here's maybe a start, not sure how it will work with non-Latin characters.
foo <- data.frame(value=rnorm(3),
address=c("blah blah province1", "blah blah province2", "province3"),
stringsAsFactors=FALSE)
words <- strsplit(foo$address, " ")
words <- do.call(rbind, words)
foo$province <- words[, 3]
head(foo)
Output:
value address province
1 0.01129269 blah blah province1 province1
2 0.99160104 blah blah province2 province2
3 1.59396745 province3 province3
Guessing by this wiki page on South Korean address formats, if the city and province (ward?) are always in the beginning of the address, then it's a bit easier and we can avoid using rbind, which in the code above recycles shorter addresses.
foo <- data.frame(value=rnorm(3),
address=c("seoul ward1 street", "seoul ward2 street", "not-seoul ward-something street"),
stringsAsFactors=FALSE)
foo$city <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][1])
foo$ward <- sapply(foo$address, function(x) strsplit(x, split=" ")[[1]][2])
Now we can also use ifelse to use wards if in Seoul and cities otherwise.
foo$group <- with(foo, ifelse(city=="seoul", ward, city))
foo
value address city ward group
1 1.0071995 seoul ward1 street seoul ward1 ward1
2 0.7192918 seoul ward2 street seoul ward2 ward2
3 -0.6047117 not-seoul ward-something street not-seoul ward-something not-seoul

Related

how to clean irregular strings & organize them into a dataframe at right column

I have two long strings that look like this in a vector:
x <- c("Job Information\n\nLocation: \n\n\nScarsdale, New York, 10583-3050, United States \n\n\n\n\n\nJob ID: \n53827738\n\n\nPosted: \nApril 22, 2020\n\n\n\n\nMin Experience: \n3-5 Years\n\n\n\n\nRequired Travel: \n0-10%",
"Job Information\n\nLocation: \n\n\nGlenview, Illinois, 60025, United States \n\n\n\n\n\nJob ID: \n53812433\n\n\nPosted: \nApril 21, 2020\n\n\n\n\nSalary: \n$110,000.00 - $170,000.00 (Yearly Salary)")
and my goal is to neatly organized them in a dataframe (output form) something like this:
#View(df)
Location Job ID Posted Min Experience Required Travel Salary
[1] Scarsdale,... 53827738 April 22... 3-5 Years 0-10% NA
[2] Glenview,... 53812433 April 21... NA NA $110,000.00 - $170,000.00 (Yearly Salary)
(...) was done to present the dataframe here neatly.
However as you see, two strings doesn't necessarily have same attibutes. Forexample, first string has Min Experience and Required Travel, but on the second string, those field don't exist, but has Salary. So this getting very tricky for me. I thought I will read between \n character but they are not set, some have two newlines, other have 4 or 5. I was wondering if someone can help me out. I will appreciate it!
We can split the string on one or more '\n' ('\n{1,}'). Remove the first word from each (which is 'Job Information') as we don't need it anywhere (x <- x[-1]). For remaining part of the string we can see that they are in pairs in the form of columnname - columnvalue. We create a dataframe from this using alternating index and bind_rows combine all of them by name.
dplyr::bind_rows(sapply(strsplit(gsub(':', '', x), '\n{1,}'), function(x) {
x <- x[-1]
setNames(as.data.frame(t(x[c(FALSE, TRUE)])), x[c(TRUE, FALSE)])
}))
# Location Job ID Posted Min Experience
#1 Scarsdale, New York, 10583-3050, United States 53827738 April 22, 2020 3-5 Years
#2 Glenview, Illinois, 60025, United States 53812433 April 21, 2020 <NA>
# Required Travel Salary
#1 0-10% <NA>
#2 <NA> $110,000.00 - $170,000.00 (Yearly Salary)

trying to create a value of strings using the | or operator

I am trying to scrape a website link. So far I downloaded the text and set it as a dataframe. I have the folllowing;
keywords <- c(credit | model)
text_df <- as.data.frame.table(text_df)
text_df %>%
filter(str_detect(text, keywords))
where credit and model are two values I want to search the website, i.e. return row with the word credit or model in.
I get the following error
Error in filter_impl(.data, dots) : object 'credit' not found
The code only returns the results with the word "model" in and ignores the word "credit".
How can I go about returning all results with either the word "credit" or "model" in.
My plan is to have keywords <- c(credit | model | more_key_words | something_else | many values)
Thanks in advance.
EDIT:
text_df:
Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use
So I am trying to extract just rows 1 and 3.
I think the issue is you need to pass a string as an argument to str_detect. To check for "credit" or "model" you can paste them into a single string separated by |.
library(tidyverse)
library(stringr)
text_df <- read_table("Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use")
keywords <- c("credit", "model")
any_word <- paste(keywords, collapse = "|")
text_df %>% filter(str_detect(text, any_word))
#> # A tibble: 2 x 3
#> Var `1` text
#> <int> <chr> <chr>
#> 1 1 Here is some credit information
#> 2 3 This line may contain the keyword model
Ok I have checked it and I think it will not work you way, as you must use the or | operator inside filter() not inside str_detect()
So it would work this way:
keywords <- c("virg", "tos")
library(dplyr)
library(stringr)
iris %>%
filter(str_detect(Species, keywords[1]) | str_detect(Species, keywords[2]))
as a keywords[1] etc you have to specify each "keyword" from this variable
I would recommend staying away from regex when you're dealing with words. There are packages tailored for your particular task that you can use. Try, for example, the following
library(corpus)
text <- readLines("http://norvig.com/big.txt") # sherlock holmes
terms <- c("watson", "sherlock holmes", "elementary")
text_locate(text, terms)
## text before instance after
## 1 1 …Book of The Adventures of Sherlock Holmes
## 2 27 Title: The Adventures of Sherlock Holmes
## 3 40 … EBOOK, THE ADVENTURES OF SHERLOCK HOLMES ***
## 4 50 SHERLOCK HOLMES
## 5 77 To Sherlock Holmes she is always the woman. I…
## 6 85 …," he remarked. "I think, Watson , that you have put on seve…
## 7 89 …t a trifle more, I fancy, Watson . And in practice again, I …
## 8 145 …ere's money in this case, Watson , if there is nothing else.…
## 9 163 …friend and colleague, Dr. Watson , who is occasionally good …
## 10 315 … for you. And good-night, Watson ," he added, as the wheels …
## 11 352 …s quite too good to lose, Watson . I was just balancing whet…
## 12 422 …as I had pictured it from Sherlock Holmes ' succinct description, but…
## 13 504 "Good-night, Mister Sherlock Holmes ."
## 14 515 …t it!" he cried, grasping Sherlock Holmes by either shoulder and loo…
## 15 553 "Mr. Sherlock Holmes , I believe?" said she.
## 16 559 "What!" Sherlock Holmes staggered back, white with…
## 17 565 …tter was superscribed to " Sherlock Holmes , Esq. To be left till call…
## 18 567 "MY DEAR MR. SHERLOCK HOLMES ,--You really did it very w…
## 19 569 …est to the celebrated Mr. Sherlock Holmes . Then I, rather imprudentl…
## 20 571 …s; and I remain, dear Mr. Sherlock Holmes ,
## ⋮ (189 rows total)
Note that this matches the term regardless of the case.
For your specific use case, do
ix <- text_detect(text, terms)
or
matches <- text_subset(text, terms)

Retrieving latitude/longitude coordinates for cities/countries that have since changed names?

Say I have a vector of cities and countries, which may or may not include names of places that have since changed names:
locations <- c("Paris, France", "Sarajevo, Yugoslavia", "Rome, Italy", "Leningrad, Soviet Union", "St Petersburg, Russia")
The problem is that I can't use something like ggmap::geocode since it doesn't appear to work well for locations whose names have changed:
ggmap::geocode(locations, source = "dsk")
lon lat
1 2.34880 48.85341 #Works for Paris
2 NA NA #Didn't work for Sarajevo
3 12.48390 41.89474 #Works for Rome
4 98.00000 60.00000 #Didn't work for the old name of St Petersburg seems to just get the center of Russia
5 30.26417 59.89444 #Worked for St Petersburg
Is there an alternative functions I could use? If I have to "update" the names of the cities & countries, is there an easy method of going through this? I have hundreds of locations that I was looking to collect the longitude and latitude coordinates.
This might not be what you had in mind, but if you use the exact same code with only the city names (and not the countries), at least the two cases that you mentioned (Sarajevo and Leningrad) seem to work fine. You could try to run the function with a modified locations vector including just the city names, and see if you still get errors. Something like this:
(cities <- gsub(',.*', '', locations))
## [1] "Paris" "Sarajevo" "Rome" "Leningrad" "St Petersburg"
cbind(ggmap::geocode(cities, source = 'dsk'), cities)
## lon lat cities
## 1 2.34880 48.85341 Paris
## 2 18.35644 43.84864 Sarajevo
## 3 12.48390 41.89474 Rome
## 4 30.26417 59.89444 Leningrad
## 5 30.26417 59.89444 St Petersburg

Extract state abbreviation and zip code from strings

I want to extract state abbreviation (2 letters) and zip code (either 4 or 5 numbers) from the following string
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
For the zip code, I tried few methods that I found on here but it didn't work mainly because of the 5 number street address or zip codes that have only 4 numbers
text <- readLines(textConnection(address))
library(stringi)
zip <- stri_extract_last_regex(text, "\\d{5}")
zip
library(qdapRegex)
rm_zip3 <- rm_(pattern="(?<!\\d)\\d{5}(?!\\d)", extract = TRUE)
zip <- rm_zip3(text)
zip
[1] "99577" "1670" "35007" "0360" "06854" "0404" "83338" "4333" "4330" "8223" "5495" "5233" NA
For the state abbreviation, I have no idea how to extract
Any help is appreciated! Thanks in advance!
Edit 1: Include phone numbers
Code to extract zip code:
zip <- str_extract(text, "\\d{5}")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{5}$)")
Code to extract phone numbers:
phone <- str_extract(text, "\\b\\d{3}-\\d{3}-\\d{4}\\b")
NOTE: Looks like there's an issue with your data because the last 2 zip codes should be 5 characters long and not 4. 4330 should actually be 04330. If you don't have control over the data source, but know for sure that they are US codes you could pad 0's on the left as required. However since you are looking for a solution for 4 or 5 characters, you can use this:
Code to extract zip code (looks for space in front and newline at the back so that parts of a phone number or an address aren't picked)
zip <- str_extract(text, "(?<= )\\d{4,5}(?=\\n|$)")
Code to extract state code:
states <- str_extract(text, "\\b[A-Z]{2}(?=\\s+\\d{4,5}$)")
Demo: https://regex101.com/r/7Im0Mu/2
I am using address as input not the text, see if it works for your case.
Assumptions on regex: Two capital letters followed by 4 or 5 numeric letters are for state and zip, The phone numbers are always on next line.
Input:
address <- "19800 Eagle River Road, Eagle River AK 99577
907-481-1670
230 Colonial Promenade Pkwy, Alabaster AL 35007
205-620-0360
360 Connecticut Avenue, Norwalk CT 06854
860-409-0404
2080 S Lincoln, Jerome ID 83338
208-324-4333
20175 Civic Center Dr, Augusta ME 4330
207-623-8223
830 Harvest Ln, Williston VT 5495
802-878-5233
"
I am using stringr library , you may choose any other to extract the information as you wish.
library(stringr)
df <- data.frame(do.call("rbind",strsplit(str_extract_all(address,"[A-Z][A-Z]\\s\\d{4,5}\\s\\d{3}-\\d{3}-\\d{4}")[[1]],split="\\s|\\n")))
names(df) <- c("state","Zip","Phone")
EDIT:
In case someone want to use text as input,
text <- readLines(textConnection(address))
text <- data.frame(text)
st_zip <- setNames(data.frame(str_extract_all(text$text,"[A-Z][A-Z]\\s\\d{4,5}",simplify = T)),"St_zip")
pin <- setNames(data.frame(str_extract_all(text$text,"\\d{3}-\\d{3}-\\d{4}",simplify = T)),"pin")
st_zip <- st_zip[st_zip$St_zip != "",]
df1 <- setNames(data.frame(do.call("rbind",strsplit(st_zip,split=' '))),c("State","Zip"))
pin <- pin[pin$pin != "",]
df2 <- data.frame(cbind(df1,pin))
OUTPUT:
State Zip pin
1 AK 99577 907-481-1670
2 AL 35007 205-620-0360
3 CT 06854 860-409-0404
4 ID 83338 208-324-4333
5 ME 4330 207-623-8223
6 VT 5495 802-878-5233
Thank you #Rahul. Both would be great. At least can you show me how to do it with Notepad++?
Extraction using Notepad++
Well first copy your whole data in a file.
Go to Find by pressing Ctrl + F. This will open search dialog box. Choose Replace tab search with regex ([A-Z]{2}\s*\d{4,5})$ and replace with \n-\1-\n. This will search for state abbreviation and ZIP code and place them in new line with - as prefix and suffix.
Now go to Mark tab. Check Bookmark Line checkbox then search with -(.*?)- and press Mark All. This will mark state abb and ZIP which are in newlines with -.
Now go to Search --> Bookmark --> Remove Unmarked Lines
Finally search with ^-|-$ and replace with empty string.
Update
So now there will be phone numbers too ? In that case you only have to remove $ from regex in step 2. Regex to use will be ([A-Z]{2}\s*\d{4,5}). Rest all steps will be same.

Split column with multiple delimiters

I am trying to determine in R how to split a column that has multiple fields with multiple delimiters.
From an API, I get a column in a data frame called "Location". It has multiple location identifiers in it. Here is an example of one entry. (edit- I added a couple more)
6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)
4284 E 61ST ST
Kansas City, MO 64130
(39.014638172000446, -94.5335298549997)
3002 SPRUCE AVE
Kansas City, MO 64128
(39.07083265200049, -94.53320606399967)
6022 E Red Bridge Rd
Kansas City, MO 64134
(38.92458893200046, -94.52090062499968)
So the above is the entry in row 1-4, column "location".
I want split this into address, city, state, zip, long and lat columns. Some fields are separated by space or tab while others by comma. Also nothing is fixed width.
I have looked at the reshape package- but seems I need a single deliminator. I can't use space (or can I?) as the address has spaces in it.
Thoughts?
If the data you have is not like this, let everyone know by adding code we can copy and paste into R to reproduce your data (see how this sample data can be easily copied and pasted into R?)
Sample data:
location <- c(
"6540 BENNINGTON AVE
Kansas City, MO 64133
(39.005620414000475, -94.50998643299965)",
"456 POOH LANE
New York City, NY 10025
(40, -90)")
location
#[1] "6540 BENNINGTON AVE\nKansas City, MO 64133\n(39.005620414000475, -94.50998643299965)"
#[2] "456 POOH LANE\nNew York City, NY 10025\n(40, -90)"
A solution:
# Insert a comma between the state abbreviation and the zip code
step1 <- gsub("([[:alpha:]]{2}) ([[:digit:]]{5})", "\\1,\\2", location)
# get rid of parentheses
step2 <- gsub("\\(|\\)", "", step1)
# split on "\n", ",", and ", "
strsplit(step2, "\n|,|, ")
#[[1]]
#[1] "6540 BENNINGTON AVE" "Kansas City" "MO"
#[4] "64133" "39.005620414000475" "-94.50998643299965"
#[[2]]
#[1] "456 POOH LANE" "New York City" "NY" "10025"
#[5] "40" "-90"
Here is an example with the stringr package.
Using #Frank's example data from above, you can do:
library(stringr)
address <- str_match(location,
"(^[[:print:]]+)[[:space:]]([[:alpha:]. ]+), ([[:alpha:]]{2}) ([[:digit:]]{5})[[:space:]][(]([[:digit:].-]+), ([[:digit:].-]+)")
address <- data.frame(address[,-1]) # get rid of the first column which has the full match
names(address) <- c("address", "city", "state", "zip", "lat", "lon")
> address
address city state zip lat lon
1 6540 BENNINGTON AVE Kansas City MO 64133 39.005620414000475 -94.50998643299965
2 456 POOH LANE New York City NY 10025 40 -90
Note that this is pretty specific to the format of the one entry given. It would need to be tweaked if there is variation in any number of ways.
This takes everything from the start of the string to the first [:space:] character as address. The next set of letters, spaces and periods up until the next comma is given to city. After the comma and a space, the next two letters are given to state. Following a space, the next five digits make up the zip field. Finally, the next set of numbers, period and/or minus signs each get assigned to lat and lon.

Resources