pattern matching with sub(), unable to catch and replace first occurrence - r

The followings are the results I expect
> title = "La La Land (2016/I)"
[1]"(2016" #result
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
[1]"(2013" #result
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
[1]"(2015" #result
==================================================================
The followings are what I got by applying codesub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(1500-1800) (#1.1)" #result. However, I expected it to be "(2013)"
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(2016/I)" #result as I expect
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1]"(2017)" # result. However, I expect it to be "(2015)"
The followings are what I GOT by applying codesub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "La La Land (2016/I)" #result. However, I expect it to be "(2016)"
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2017)" #result. However, I expect it to be "(2015)"
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2013)" #result as I expect
I checked the description of sub, it says "sub performs replacement of the first match. In this case, the first match should be (2013).
In a word, I try to write a sub()command to return the first occurrence of a year in a string.
I guess there is something wrong with my code but couldn't find it, appreciate if anyone could help me.
==================================================================
In fact, my ultimate goal is to extract the year of all movies. However, I don't know how to do it in one step. Therefore, I decide to first find the year in (dddd format, then use code sub(pattern="\\((\\d{4}).*", a, replacement="\\1") to find the pure number of the year.
for example:
> a= "(2015"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
> a= "(2015)"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
=================updated 05/29/2017 22:51PM=======================
the str_extract in akrun's answer works well with my dataset.
However, the sub() doesn't work for all data. The following are what I did. However, my code doesn't work with all 500 records. I would really appreciate if anyone could point out the mistakes on my codes. I really cannot figure it out myself. Thank you very much.
> t1
[1] "Man Who Fell to Earth (Remix) (2010) (TV)"
> t2
[1] "Manual pr\u0087ctico del amigo imaginario (abreviado) (2008)"
> title = c(t1,t2)
> x=gsub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> x
[1] "(2010) (TV)" "(2008)"
> sub(pattern="\\((.*)\\).*", x, replacement="\\1")
[1] "2010) (TV" "2008"
However, my goal is to get 2010 and 2008. My code works with t2 but fails with t1

We can match 0 or more characters that are not a ( ([^(]*) from the start (^) of the string, followed by a ( and four digits (\\([0-9]{4}) which we capture as a group ((...)) followed by other characters (.*) and replace with the backreference (\\1) of the captured group
sub("^[^(]*(\\([0-9]{4}).*", "\\1", title)
#[1] "(2016" "(2013" "(2015"
If we need to remove the (, then capture only the numbers that follows the \\( as a group
sub("^[^(]*\\(([0-9]{4}).*", "\\1", title)
#[1] "2016" "2013" "2015"
Or with str_extract, we use a regex lookaround to extract the 4 digit numbers that follows the (
library(stringr)
str_extract(title, "(?<=\\()[0-9]{4}")
#[1] "2016" "2013" "2015"
Or with regmatches/regexpr
regmatches(title, regexpr("(?<=\\()([0-9]{4})", title, perl = TRUE))
#[1] "2016" "2013" "2015"
data
title <- c("La La Land (2016/I)",
"_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_",
"dfajfj(2015)asdfjuwer f(2017)fa.erewr6")

Related

How can I parse the text from one countryName to another countryName in R?

I'm just having a really hard time figuring this out. Let's go straight to the data.
library(countrycode)
countries <- codelist$country.name.en #list of countries from the library
text <- "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. (Spain) No information available. (Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."
I'd want to create a list of the parsed text (eg. from "(France)" to "Nissan.") for all three countries. The actual text is 30 pages long and each (countryName) is followed by several paragraphs of text.
All the countryNames are in parentheses but there might be other non-country parentheses in the text or countryNames without parentheses. But the general pattern is that each segment I want to parse starts with (countryName1) and ends with (countryName2)
Output:
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan."
[[2]]
[1] "(Spain) No information available."
[[3]]
[1] "(Chad) Mr.Smith (from N'Djamena) bought a new house. It's very nice."
If all the countries in the 'text' matches with the reference vector, we may paste the reference vector into a single string to split the string just before the country match
as.list(strsplit(text, sprintf('(?<=\\s)(?=(%s))',
paste(paste0("\\(", countries), collapse = "|")), perl = TRUE)[[1]])
-output
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. "
[[2]]
[1] "(Spain) No information available. "
[[3]]
[1] "(Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."

R: Replace Abbreviations\ Words

I have tried to resolve this problem all day but without any improvement.
I am trying to replace the following abbreviations into the following desired words in my dataset:
-Abbreviations: USA, H2O, Type 3, T3, bp
Desired words United States of America, Water, Type 3 Disease, Type 3 Disease, blood pressure
The input data is for example
[1] I have type 3, its considered the highest severe stage of the disease.
[2] Drinking more H2O will make your skin glow.
[3] Do I have T2 or T3? Please someone help.
[4] We don't have this on the USA but I've heard that will be available in the next 3 years.
[5] Having a high bp means that I will have to look after my diet?
The desired output is
[1] i have type 3 disease, its considered the highest severe stage
of the disease.
[2] drinking more water will make your skin glow.
[3] do I have type 3 disease? please someone help.
[4] we don't have this in the united states of america but i've heard that will be available in the next 3 years.
[5] having a high blood pressure means that I will have to look after my diet?
I have tried the following code but without success:
data= read.csv(C:"xxxxxxx, header= TRUE")
lowercase= tolower(data$MESSAGE)
dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"=
"water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"=
"blood pressure")
for(i in 1:length(dict1)){
lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"),
dict[[i]], lowercase)}
I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.
If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).
An example code:
abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
library(stringr)
str_replace_all(x,
paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"),
function(z) df$desired_words[df$abbreviations==z][[1]][1]
)
The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.
See the R demo online.

How to use hunspell package to suggest correct words in a column in R?

I'm currently working with a large data frame containing lots of text in each row and would like to effectively identify and replace misspelled words in each sentence with the hunspell package. I was able to identify the misspelled words, but can't figure out how to do hunspell_suggest on a list.
Here is an example of the data frame:
df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
"Mary and Samantha arived at the bus staton before noon",
"I did not see thm at the station in the mrning",
"The participnts read 60 sentences in radom order",
"how to fix mispelled words in R languge",
"today is Tuesday",
"bing sports quiz"))
I converted the text column into character and used hunspell to identify the misspelled words within each row.
library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)
I tried
df1$suggest <- hunspell_suggest(df1$word_check)
but it keeps giving this error:
Error in hunspell_suggest(df1$word_check) :
is.character(words) is not TRUE
I'm new to this so I'm not exactly sure how does the suggest column using hunspell_suggest function would turn out. Any help will be greatly appreciated.
Check your intermediate steps. The output of df1$word_check is as follows:
List of 5
$ : chr [1:2] "complec" "independet"
$ : chr [1:2] "arived" "staton"
$ : chr [1:2] "thm" "mrning"
$ : chr [1:2] "participnts" "radom"
$ : chr [1:2] "mispelled" "languge"
which is of type list. If you did lapply(df1$word_check, hunspell_suggest) you can get the suggestions.
EDIT
I decided to go into more detail on this question as I have not seen any easy alternative. This is what I have come up with:
cleantext = function(x){
sapply(1:length(x),function(y){
bad = hunspell(x[y])[[1]]
good = unlist(lapply(hunspell_suggest(bad),`[[`,1))
if (length(bad)){
for (i in 1:length(bad)){
x[y] <<- gsub(bad[i],good[i],x[y])
}}})
x
}
Although there probably is a more elegant way of doing it, this function returns a vector of character strings corrected as such:
> df1$Text
[1] "A complec sentence joins an independet"
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"
[4] "The participnts read 60 sentences in radom order"
[5] "how to fix mispelled words in R languge"
[6] "today is Tuesday"
[7] "bing sports quiz"
> cleantext(df1$Text)
[1] "A complex sentence joins an independent"
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"
[4] "The participants read 60 sentences in radon order"
[5] "how to fix misspelled words in R language"
[6] "today is Tuesday"
[7] "bung sports quiz"
Watch out, as this returns the first suggestion given by hunspell - which may or may not be correct.

Turn Street Address Into Components

I have address data I extracted from SQL, and have now loaded into R. I am trying to extract out the individual components, namely the ZIP-CODE at the end of the query (State would also be nice). I would like the ZIP-CODE and State to be in new individual columns.
The primary issue is the ZIP-CODE is sometimes 5 digits, and sometimes 9.
Two example rows would be:
Address_FULL
1234 NOWHERE ST WASHINGTON DC 20005
567 EVERYWHERE LN CHARLOTTE NC 22011-1203
I suspect I'll need some kind of regex \\d{5} notation, or some kind of fancy manipulation in dplyr that I'm not aware exists.
If the zip code is always at the end you could use
str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$")
To add a "zip" column via dplyr you could use
df %>% mutate(zip = str_extract(Address_FULL,"[[:digit:]]{5}(-[[:digit:]]{4})?$"))
Where df is your dataframe containing Address_FULL and
str_extract() is from stringr.
State could be extracted as follows:
str_extract(Address_FULL,"(?<=\\s)[[:alpha:]]{2}(?=\\s[[:digit:]]{5})")
However, this makes the following assumptions:
The state abbreviation is 2 characters long
The state abbreviation is followed immediately by a space
The zip code follows immediately after the space that follows the state
Assuming that the zip is always at the end, you can try:
tail(unlist(strsplit(STRING, split=" ")), 1)
For example
ex1 = "1234 NOWHERE ST WASHINGTON DC 20005"
ex2 = "567 EVERYWHERE LN CHARLOTTE NC 22011-1203"
> tail(unlist(strsplit(ex1, split=" ")), 1)
[1] "20005"
> tail(unlist(strsplit(ex2, split=" ")), 1)
[1] "22011-1203"
Use my package tfwstring
Works automatically on any address type, even with prefixes and suffixes.
if (!require(remotes)) install.packages("remotes")
remotes::install_github("nbarsch/tfwstring")
parseaddress("1234 NOWHERE ST WASHINGTON DC 20005", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"1234" "NOWHERE" "ST" "WASHINGTON" "DC" "20005"
parseaddress("567 EVERYWHERE LN CHARLOTTE NC 22011-1203", force_stateabb = F)
AddressNumber StreetName StreetNamePostType PlaceName StateName ZipCode
"567" "EVERYWHERE" "LN" "CHARLOTTE" "NC" "22011-1203"

Structure character data into data frame

I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.
Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean

Resources