Can I create a vector with regexps?

Can I create a vector with regexps? - r

My data looks somthing like this:
412 U CA, Riverside
413 U British Columbia
414 CREI
415 U Pompeu Fabra
416 Office of the Comptroller of the Currency, US Department of the Treasury
417 Bureau of Economics, US Federal Trade Commission
418 U Carlos III de Madrid
419 U Brescia
420 LUISS Guido Carli
421 U Alicante
422 Harvard Society of Fellows
423 Toulouse School of Economics
424 Decision Economics Inc, Boston, MA
425 ECARES, Free U Brussels
I will need to geocode this data in order to get the coordinates for each specific institution. in order to do that I need all state names to be spelled out. At the same time I don't want acronyms like "ECARES" to be transformed into "ECaliforniaRES".
I have been toying with the idea of converting the state.abb and state.name vectors into vectors of regular expressions, so that state.abb would look something like this (Using Alabama and California as state 1 and state 2):
c("^AL "|" AL "|" AL,"|",AL "| " AL$", "^CA "[....])
And the state.name vector something like this:
c("^Alabama "|" Alabama "|" Alabama,"|",Alabama "| " Alabama$", "^California "[....])
Hopefully, I can then use the mgsub function to replace all expressions in the modified state.abb vector with the corresponding entries in the modified state.name vector.
For some reason, however, it doesn't seem to be possible to put regexps in a vector:
test<-c(^AL, ^AB)
Error: unexpected '^' in "test<-c(^"
I have tried excusing the "^"-signs but this doesnt really seem to work:
test<-c(\^AL, \^AB)
Error: unexpected input in "test<-c(\"
> test<-c(\\^AL, \\^AB)
Is there any way of putting regexps in a vector, or is there another way of achieving my goal (that is, to replace all two-letter state abbreviations to state names without messing up other acronyms in the process)?
Excerpt of my data:
c("U Lausanne", "Swiss Finance Institute", "U CA, Riverside",
"U British Columbia", "CREI", "U Pompeu Fabra", "Office of the Comptroller of the Currency, US Department of the Treasury",
"Bureau of Economics, US Federal Trade Commission", "U Carlos III de Madrid",
"U Brescia", "LUISS Guido Carli", "U Alicante", "Harvard Society of Fellows",
"Toulouse School of Economics", "Decision Economics Inc, Boston, MA",
"ECARES, Free U Brussels", "Baylor U", "Research Centre for Education",
"the Labour Market, Maastricht U", "U Bonn", "Swarthmore College"
)

We can make use of the state.abb vector and paste it together by collapseing with |
pat1 <- paste0("\\b(", paste(state.abb, collapse="|"), ")\\b")
The \\b signifies the word boundary so that indiscriminate matches "SAL" can be avoided
and similarly with state.name, paste the ^ and $ as prefix/suffix to mark the start, end of the string respectively
pat2 <- paste0("^(", paste(state.name, collapse="|"), ")$")

Related

Adding a string between some pattern in R

I would like to add the string "AND" between the words "STREET" and "HENRY" into the following string:
WEST 156 STREET HENRY HUDSON PARKWAY
So that it reads WEST 156 STREET AND HENRY HUDSON PARKWAY. Essentially, I am trying to geocode intersections so I would like to be able to add "AND" between street types (AVENUE, STREET, BLVD, etc.) and whatever word comes after that to create the intersection like I specified above.
Here are a couple more examples (just made up):
strings = c("WEST 135TH AVE BROADWAY", # want WEST 135TH AVE AND BROADWAY,
"SUNSET BLVD MAIN ST", # SUNSET BLVD AND MAIN ST
"W 45TH ST LAKESHORE BLVD", #...
"HIGH ST BROAD ST") # ...
I would greatly appreciate any help! I am somewhat familiar with regular expressions, but I am not familiar with how to insert another word in this manner.

capture the words as a group and replace with backreference (\\1) along with the substring "AND". For the third and fourth strings, as it is at the end of the string, it wouldn't replace as we used \\s+ (one or more spaces)
sub("(AVENUE|AVE|STREET|BLVD)\\s+", "\\1 AND ", strings)
-output
[1] "WEST 135TH AVE AND BROADWAY" "SUNSET BLVD AND MAIN ST"
[3] "W 45TH ST LAKESHORE BLVD" "HIGH ST BROAD ST"

Extracting first word after a specific expression in R

I have a column that contains thousands of descriptions like this (example) :
Description
Building a hospital in the city of LA, USA
Building a school in the city of NYC, USA
Building shops in the city of Chicago, USA
I'd like to create a column with the first word after "city of", like that :
Description
City
Building a hospital in the city of LA, USA
LA
Building a school in the city of NYC, USA
NYC
Building shops in the city of Chicago, USA
Chicago
I tried with the following code after seeing this topic Extracting string after specific word, but my column is only filled with missing values
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))
I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.

Solution
This should make the trick for the data you showed:
df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
Alternative
However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
Check out the following example:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
Documentation
Check out ?regex:
Patterns (?=...) and (?!...) are zero-width positive and negative
lookahead assertions: they match if an attempt to match the ...
forward from the current position would succeed (or not), but use up
no characters in the string being processed. Patterns (?<=...) and
(?<!...) are the lookbehind equivalents: they do not allow repetition
quantifiers nor \C in ....

Unable to remove the foreign languages unicode codes

I have a .csvfile with multiple foreign languages(russian, japanese, arabic,etc) info within. For example, a column entry look like this:<U+03BA><U+03BF><U+03C5>.I want to remove rows which have this kind of info.
I tried various solutions for, all of them with no result:
test_fb5 <- read_csv('test_fb_data.csv', encoding = 'UTF-8')
or applied for a column:
gsub("[<].*[>]", "")` or `sub("^\\s*<U\\+\\w+>\\s*", "")
or
gsub("\\s*<U\\+\\w+>$", "")
It seems that R 4.1.0 doesn't find the respective chars. I cannot find a way to attach a small chunk of file here.
Here is the capture of the file:
address
33085 9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta
33086 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec
33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region
33088 390 progress ave unit 2 m1p 2z6 toronto on canada ontario
name
33085 md legals canada inc
33086 les aspirateurs jpg inc
33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis
33088 wrench it up plumbing mechanical
category
33085 general practice attorneys divorce family law attorneys notaries
33086 <NA>
33087 mediterranean restaurants fish seafood restaurants
33088 plumbing services damage restoration mold remediation
phone
33085 17808512828
33086 14507781003
33087 302298072134
33088 14168005050
the 3308's are the rows of the dataset
Thank you for your time!

You can use a negative character class to remove the <U...> codes:
gsub("<[^>]+>", "", x)
This matches any substring that:
starts with <,
is followed one or more times by any character except the > character, and
ends on >
If you have other substrings between <and >, which you do not want to remove, just add U to more specifically target unicode codes, thus: <U[^>]+>
Data:
x <- "address 33085 9848a 33 avenue nw t6n 1c6 edmonton ab canada alberta 33086 1075 avenue laframboise j2s 4w7 sainthyacinthe qc canada quebec 33087 <U+03BA><U+03BF><U+03C5><U+03BD><U+03BF><U+03C5>p<U+03B9>tsa 18050 spétses greece attica region 33088 390 progress ave unit 2 m1p 2z6 toronto on canada ontario name 33085 md legals canada inc 33086 les aspirateurs jpg inc 33087 p<U+03AC>t<U+03C1>a<U+03BB><U+03B7><U+03C2>patralis 33088 wrench it up plumbing mechanical category 33085 general practice attorneys divorce family law attorneys notaries 33086 <NA> 33087 mediterranean restaurants fish seafood restaurants 33088 plumbing services damage restoration mold remediation phone 33085 17808512828 33086 14507781003 33087 302298072134 33088 14168005050"

Extracting cities from string vector in R

I have a column in my dataset db, say db$affiliation, which looks like:
**db$affiliation**
[1] "[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA"
[2] "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS."
[3] "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
[4] ...
I would like to create a column within the same dataset containing only the name of the city in db$affiliation, such as
**db$cities**
[1] LOS ANGELES
[2] TWENTE
[3] BANGKOK
[4] ...
If multiple city names are available, I'd like the command to return only the last one, if no city names are available I'd like to have NA. How can I do that?
I thought that I could use world.cities$name in data(world.cities) in the maps package but I can not figure out how.
I even tried to split the db$affiliation column such as:
db$affiliation <- gsub("\\[[^\\]]*\\]", "", db$affiliation, perl=TRUE) # remove content within brackets
db$affiliation[2] # check the separator
db <- cSplit(db, 'affiliation', sep=c(", "), type.convert=FALSE) # split after comma
Which results (I've truncated it after affiliation_3) in:
affiliation_1 affiliation_2 affiliation_3
[1] UNIV CALIF LOS ANGELES DEPT GEOG LOS ANGELES
[2] UNIV TWENTE DEPT WATER ENGN & MANAGEMENT DRIENERLOLAAN
[3] CHULALONGKORN UNIV FAC ARCHITECTURE BANGKOK
And then pass:
db$cities <- lapply(db$affiliation_1, function(x)x[which(x %in% world.cities$name)])
But I get an empty column.
Thanks for the help!

There are many cities in your sample string so you may need to think again if you still want to fetch the 'last city' in case multiple cities are found in affiliation column.
library(maps)
data(world.cities)
#sample data
df <- data.frame(affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.",
"Prem"), stringsAsFactors = F)
#fetch city and it's respective country from 'affiliation' column
cities_country <- lapply(gsub("\\[|\\]|[,;]|\\.","",df$affiliation), function(x)
paste(as.character(world.cities$name[sapply(world.cities$name, grepl, x, ignore.case=T)]),
as.character(world.cities$country.etc[sapply(world.cities$name, grepl, x, ignore.case=T)]),
sep="_"))
df$cities_country <- lapply(cities_country, function(x) if(identical(x, character(0))) NA_character_ else x)
df
Output is:
affiliation
1 [SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA
2 [VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.
3 [ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.
4 Prem
cities_country
1 Al_Norway, Alle_Switzerland, Allen_Philippines, Allen_USA, Angeles_Costa Rica, Angeles_Philippines, Cali_Colombia, Cot_Costa Rica, Li_Norway, Los Angeles_Chile, Los Angeles_USA, Os_Kyrgyzstan, Os_Norway, U_Micronesia, Usa_Japan
2 Ae_Marshall Islands, Ede_Netherlands, Ede_Nigeria, Enschede_Netherlands, Hede_China, Ine_Marshall Islands, Laa_Austria, Lola_Guinea, Man_Ivory Coast, Mana_French Guiana, Manage_Belgium, Nagem_Luxembourg, Ob_Russia, Ola_Panama, Po_Burkina Faso, U_Micronesia, Van_Turkey, Wa_Ghana, We_New Caledonia
3 Aila_Estonia, Al_Norway, Anan_Japan, Ba_Fiji, Bangkok_Thailand, Hit_Iraq, Ila_Nigeria, Ilan_Taiwan, Long_Thailand, Nan_Thailand, Tsu_Japan, U_Micronesia, Ula_Turkey
4 NA
(Note that in above output I have kept all occurrences of cities and for convenience also suffixed it with their respective countries)

From the few lines you have shown it looks like you might be able to do the following (note you missed aligning the casing):
tmpVec <- sapply(strsplit(db$affiliation, split = ","), function(x) {
cleanVec <- toupper(trimws(x))
cleanVec[max(which(cleanVec %in% toupper(maps::world.cities$name)))]
})
Or put a bit more code into the function to avoid the ugly warnings.

Let me leave a part of a solution. As far as I can tell from my own research, letters in the square brackets seem to indicate personal names. For example, I found that Sutee Anantsuksomsri is an actual name. This observation suggests that we probably want to remove texts in the brackets.
Once I removed the texts in the square brackets, I split the words using unnest_tokens() in the tidytext package. Note that the function converts all letters to small letters. If you do not like it, you can change that by specifying to_lower = FALSE. First, I split each city name into word. I also assigned an ID number for each city. Second, I cleaned up your data. As I said earlier, I removed texts in square brackets using gsub(). Then, I applied unnest_tokens() to the data. I subset words using the words from cities in filter(). The result we get up to this point is the following. Obviously, you have more work to do. I leave the sampling data, mydf below. I hope you can move on from here.
data(world.cities)
cities <- world.cities %>%
mutate(id = 1:n()) %>%
unnest_tokens(input = name, output = word, token = "words")
temp <- mydf %>%
mutate(affiliation = gsub(x = affiliation, pattern = "\\[.*\\]", replacement = "")) %>%
unnest_tokens(input = affiliation, output = word, token = "words") %>%
filter(word %in% cities$word)
id word
1 1 los
2 1 angeles
3 1 los
4 1 angeles
5 1 ca
6 1 usa
7 2 water
8 2 ae
9 2 enschede
10 3 bangkok
DATA
mydf <- structure(list(id = 1:3, affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
)), .Names = c("id", "affiliation"), row.names = c(NA, -3L), class = "data.frame")

Remove specific string at the end position of each row from dataframe(csv)

I am trying to clean a set of data which is in csv format. After loading data into R, i need to replace and also remove some characters from the it. Below is an example. Ideally i want to
replace the St at the end of each -> Street
in cases where there are St St.
i need to remove St and replace St. with just Street.
I tried to use this code
sub(x = evostreet, pattern = "St.", replacement = " ") and later
gsub(x = evostreet, pattern = "St.", replacement = " ") to remove the St. at the end of each row but this also remove some other occurrences of St and the next character
3 James St.
4 Glover Road St.
5 Jubilee Estate. St.
7 Fed Housing Estate St.
8 River State School St.
9 Brown State Veterinary Clinic. St.
11 Saw Mill St.
12 Dyke St St.
13 Governor Rd St.

I'm seeing a lot of close answers but I'm not seeing any that address the second problem he's having such as replacing "St St." with "Street"; e.g., "Dyke St St."
sub, as stated in the documentation:
The two *sub functions differ only in that sub replaces only the first occurrence of a pattern
So, just using "St\\." as the pattern match is incorrect.
OP needs to match a possible pattern of "St St." and I'll further assume that it could even be "St. St." or "St. St".
Assuming OP is using a simple list:
x = c("James St.", "Glover Road St.", "Jubilee Estate. St.",
"Fed Housing Estate St.", "River State School St St.",
"Brown State Vet Clinic. St. St.", "Dyke St St.")`
[1] "James St." "Glover Road St."
[3] "Jubilee Estate. St." "Fed Housing Estate St."
[5] "River State School St St." "Brown State Vet Clinic. St. St."
[7] "Dyke St St."
Then the following will replace the possible combinations mentioned above with "Street", as requested:
y <- sub(x, pattern = "[ St\\.]*$", replacement = " Street")
[1] "James Street" "Glover Road Street"
[3] "Jubilee Estate Street" "Fed Housing Estate Street"
[5] "River State School Street" "Brown State Vet Clinic Street"
[7] "Dyke Street"
Edit:
To answer OP's question below in regard to replacing one substr of St. with Saint and another with Street, I was looking for a way to be able to match similar expressions to return different values but at this point I haven't been able to find it. I suspect regmatches can do this but it's something I'll have to fiddle with later.
A simple way to accomplish what you're wanting - let's assume:
x <- c("St. Mary St St.", "River State School St St.", "Dyke St. St")
[1] "Saint Mary St St." "River State School St St."
[3] "Dyke St. St"
So you want x[1] to be Saint Mary Street, x[2] to be River State School Street and x[3] to be Dyke Street. I would want to resolve the Saint issue first by assigning sub() to y like:
y <- sub(x, pattern = "^St\\.", replacement = "Saint")
[1] "Saint Mary Street" "River State School Street"
[3] "Dyke Street"
To resolve the St's as the end, we can use the same resolution as I posted except notice now I'm not using x as my input vector but isntead the y I just made:
y <- sub(y, pattern = "[ St\\.]*$", replacement = " Street")
And that should take care of it. Now, I don't know if this is the most efficient way. And if you're dataset is rather large this may run slow. If I find a better solution I will post it (provided no one else beats me).

You don't need to use regular expression here.
sub(x = evostreet, pattern = "St.", replacement = " ", fixed=T)
The fixed argument means that you want to replace this exact character, not matches of a regular expression.

I think that your problem is that the '.' character in the regular expression world means "any single character". So to match literally in R you should write
sub(x = evostreet, pattern = "St\\.", replacement = " ")

You will need to "comment" the dot... otherwise it means anything after St and that is why some other parts of your text are eliminated.
sub(x = evostreet, pattern = "St\\.", replacement = " ")
You can add $ at the end if you want to remove the tag apearing just at the end of the text.
sub(x = evostreet, pattern = "St\\.$", replacement = " ")
The difference between sub and gsub is that sub will deal just with the firs time your tag appears in a text. gsub will eliminate all if there are duplicated. In your case as you are looking for the pattern at the end of the line it should not make any difference if you use the $.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Can I create a vector with regexps? - r

Related

Adding a string between some pattern in R

Extracting first word after a specific expression in R

Unable to remove the foreign languages unicode codes

Extracting cities from string vector in R

Remove specific string at the end position of each row from dataframe(csv)

Categories

Resources