R grepl not giving desired result loading CSV - r

I dont know what I could be overlooking here but I am importing a csv file with a bunch of names into a data.frame. When I pull the data frame value and run grepl against it there is no match. If I take that same value and manually create a string it matches fine. Any help would be appreciated.
I obviously cant give you the CSV or the data source so I have tried to include all the code below.
After further look, it seems the string no longer has a space
> Parks[1,2]
[1] "Abraham Lincoln Birthplace National Historical Park"
> typeof(Parks[1,2])
[1] "character"
> grepl(" ", Parks[1,2], fixed = TRUE)
[1] FALSE
> grepl("National Historical Park", Parks[1,2])
[1] FALSE
> grepl("National", Parks[1,2], fixed = TRUE)
[1] TRUE
> grepl("National Historical Park", "Abraham Lincoln Birthplace National Historical Park")
[1] TRUE
> grepl(" ", "Abraham Lincoln Birthplace National Historical Park")
[1] TRUE

The blank spaces were unicode \u2022 characters. Running the following code before grepl results in the desired result.
> Code <- Parks[1,2]
> Code <- gsub('[^\x20-\x7E]', ' ', Code)
> grepl(" ", Parks[1,2], fixed = TRUE)
[1] TRUE
> grepl("National Historical Park", Parks[1,2])
[1] TRUE

Related

Mixed kana and kanji romanization to romaji in R

I have a large character vector of japanese words (mixed kanji and kana) which needs to be romanized (to romaji).
However with the available functions, (zipangu::str_conv_romanhira() and audubon::strj_romanize()), I am not getting the desired results.
For example for 北海道 (Hokkaido), zipangu::str_conv_romanhira() convert it to chinese pinyin and audubon::strj_romanize() converts only kana characters.
How to convert such mixed kana and kanji text to romaji.
library(zipangu)
library(stringi)
library(audubon)
str_conv_romanhira("北海道", "roman")
#> [1] "běi hǎi dào"
stri_trans_general("北海道", "Any-Latin")
#> [1] "běi hǎi dào"
strj_romanize("北海道")
#> [1] ""
There aren't any R packages that provide transliteration of Japanese kanji to romaji that I can see (at least none that are currently on CRAN). It's easy enough, however, to use the python module pykakasi via R to achieve this:
library(reticulate)
py_install("pykakasi") # Only need to install once
# Make module available in R
pykakasi <- import("pykakasi")
# Alias the convert function for convenience
convert <- pykakasi$kakasi()$convert
convert("北海道")
[[1]]
[[1]]$orig
[1] "北海道"
[[1]]$hira
[1] "ほっかいどう"
[[1]]$kana
[1] "ホッカイドウ"
[[1]]$hepburn
[1] "hokkaidou"
[[1]]$kunrei
[1] "hokkaidou"
[[1]]$passport
[1] "hokkaidou"
# Function to extract romaji and collapse
to_romaji <- function(txt) {
paste(sapply(convert(txt), `[[`, "hepburn"), collapse = " ")
}
# Test on some longer text
lapply(c("北海道", "石の上にも三年", "豚に真珠"), to_romaji)
[[1]]
[1] "hokkaidou"
[[2]]
[1] "ishi no ueni mo sannen"
[[3]]
[1] "buta ni shinju"

Extract certain words from dynamic strings vector

I'm working with questionnaire datasets where I need to extract some brands' names from several questions. The problem is each data might have a different question line, for example:
Data #1
What do you know about AlphaToy?
Data #2
What comes to your mind when you heard AlphaCars?
Data #3
What do you think of FoodTruckers?
What I want to extract are the words AlphaToy, AlphaCars, and FoodTruckers. In Excel, I can get those brands' names via flash fill, the illustration is below.
As I working with R, I need to convert the "flash fill" step into an R function, yet I couldn't found out how to do it. Here's desired output:
brandName <- list(
Toy = c(
"1. What do you know about AlphaToy?",
"2. What do you know about BetaToyz?",
"3. What do you know about CharlieDoll?",
"4. What do you know about DeltaToys?",
"5. What do you know about Echoty?"
),
Car = c(
"18. What comes to your mind when you heard AlphaCars?",
"19. What comes to your mind when you heard BestCar?",
"20. What comes to your mind when you heard CoolCarz?"
),
Trucker = c(
"5. What do you think of FoodTruckers?",
"6. What do you think of IceCreamTruckers?",
"7. What do you think of JellyTruckers?",
"8. What do you think of SodaTruckers?"
)
)
extractBrandName <- function(...) {
#some codes here
}
#desired output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
As the title says, the function should work to dynamic strings, so when the function is applied to brandName the desired output is:
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Edit:
The brand name can be in lowercase, uppercase, or even two words or more, for instance: IBM, Louis Vuitton
The brand names might appear in the middle of the sentence, it's not always come at the end of the sentence. The thing is, the sentences are unpredictable because each client might provide different data of each other
Can anyone help me with the function code to achieve the desired output? Thank you in advance!
Edit, here's attempt
The idea (thanks to shs' answer) is to find similar words from the input, then exclude them leaving the unique words (it should be the brand names) behind. Following this post, I use intersect() wrapped inside a Reduce() to get the common words, then I exclude them via lapply() and make sure any two or more words brand names merged together with str_c(collapse = " ").
Code
library(stringr)
extractBrandName <- function(x) {
cleanWords <- x %>%
str_remove_all("^\\d+|\\.|,|\\?") %>%
str_squish() %>%
str_split(" ")
commonWords <- cleanWords %>%
Reduce(intersect, .)
extractedWords <- cleanWords %>%
lapply(., function(y) {
y[!y %in% commonWords] %>%
str_c(collapse = " ")
}) %>% unlist()
return(extractedWords)
}
Output (1st test case)
> #output
> extractBrandName(brandName$Toy)
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
> lapply(brandName, extractBrandName)
$Toy
[1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
$Car
[1] "AlphaCars" "BestCar" "CoolCarz"
$Trucker
[1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"
Output (2nd test case)
This test case includes two or more words brand names, located at the middle and the beginning of the sentence.
brandName2 <- list(
Middle = c("Have you used any products from AlphaToy this past 6 months?",
"Have you used any products from BetaToys Collection this past 6 months?",
"Have you used any products from Charl TOYZ this past 6 months?"),
First = c("AlphaCars is the best automobile dealer, yes/no?",
"Best Vehc is the best automobile dealer, yes/no?",
"CoolCarz & Bike is the best automobile dealer, yes/no?")
)
> #output
> lapply(brandName2, extractBrandName)
$Middle
[1] "AlphaToy" "BetaToys Collection" "Charl TOYZ"
$First
[1] "AlphaCars" "Best Vehc" "CoolCarz & Bike"
In the end, the solution to this problem is found. Thanks to shs who gave the initial idea and the answer from the post I linked above. If you have any suggestions, please feel free to comment. Thank you.
This function checks which words the first two strings have in common and then removes everything from the beginning of the strings up to and including the common element, leaving only the desired part of the string:
library(stringr)
extractBrandName <- function(x) {
x %>%
str_split(" ") %>%
{.[[1]][.[[1]] %in% .[[2]]]} %>%
str_c(collapse = " ") %>%
str_c("^.+", .) %>%
str_remove(x, .) %>%
str_squish() %>%
str_remove("\\?")
}
lapply(brandName, extractBrandName)
#> $Toy
#> [1] "AlphaToy" "BetaToyz" "CharlieDoll" "DeltaToys" "Echoty"
#>
#> $Car
#> [1] "AlphaCars" "BestCar" "CoolCarz"
#>
#> $Trucker
#> [1] "FoodTruckers" "IceCreamTruckers" "JellyTruckers" "SodaTruckers"

R: Unexpected behavior of str_extract from stringr on a string extracted from the web with rvest

I know this is an extremely weird example but it is reproducible:
I have a simple regex pattern to extract a person's height:
pattern <- "1\\.[0-9]{2} m"
Tested on a simple string it works:
library(stringr)
str_extract("1.75 m", pattern)
[1] "1.75 m"
However, it doesn't work on a string I scrape from Wikipedia, say to extract Linda Evangelista's height, using html_text from rvest:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Linda_Evangelista"
text <- read_html(url) %>%
html_nodes(".infobox") %>%
html_text()
text
[1] "Linda Evangelista\n\nEvangelista in August 2004\n\nBorn\n(1965-05-10) May 10, 1965 (age 52)St. Catharines, Ontario, Canada\nOccupation\nModel\nYears active\n1984–1998 (retired)\n2001–present\nSpouse(s)\nGérald Marie\n(m. 1987; div. 1993)\nChildren\n1\nModeling information\nHeight\n5 ft 9 in (1.75 m)[1]\nHair color\nBrown\nEye color\nBlue-green\nManager\nDNA Model Management (New York)Models 1 (London)\nView Management (Barcelona)\nPriscilla's Model Management (Sydney)\n\n"
str_extract(text, pattern)
[1] NA
Though, if you look closely, the "1.75 m" string is there.
To be sure, if I manually copy-paste the above string into a new variable, str_extract works as expected:
text_manual <- "Linda Evangelista\n\nEvangelista in August 2004\n\nBorn\n(1965-05-10) May 10, 1965 (age 52)St. Catharines, Ontario, Canada\nOccupation\nModel\nYears active\n1984–1998 (retired)\n2001–present\nSpouse(s)\nGérald Marie\n(m. 1987; div. 1993)\nChildren\n1\nModeling information\nHeight\n5 ft 9 in (1.75 m)[1]\nHair color\nBrown\nEye color\nBlue-green\nManager\nDNA Model Management (New York)Models 1 (London)\nView Management (Barcelona)\nPriscilla's Model Management (Sydney)\n\n"
str_extract(text_manual, pattern)
[1] "1.75 m"
Note both text variables are simple strings:
class(text)
[1] "character"
typeof(text)
[1] "character"
class(text_manual)
[1] "character"
typeof(text_manual)
[1] "character"
But are they identical? No:
text == text_manual
[1] FALSE
They seem to differ on the 83rd character:
str_sub(text, 1, 82) == str_sub(text_manual, 1, 82)
[1] TRUE
str_sub(text, 1, 83) == str_sub(text_manual, 1, 83)
[1] FALSE
But I have no idea why, they appear the same, that last character is a space in both:
str_sub(text, 1, 83)
[1] "Linda Evangelista\n\nEvangelista in August 2004\n\nBorn\n(1965-05-10) May 10, 1965 (age "
str_sub(text_manual, 1, 83)
[1] "Linda Evangelista\n\nEvangelista in August 2004\n\nBorn\n(1965-05-10) May 10, 1965 (age "
I thought about opening an issue in the stringr package on Github but I'm not sure whether it's a stringr or rvest issue.
Anyone might have any idea what's the issue here?
The two string are different because they are encoded differently:
Encoding(text)
#> [1] "UTF-8"
Encoding(text_manual)
#> [1] "latin1"
utf8ToInt(str_sub(text, 83, 83))
#> [1] 160
utf8ToInt(str_sub(text_manual, 83, 83))
#> [1] 32
intToUtf8(utf8ToInt(str_sub(text, 83, 83)))
#> [1] " "
intToUtf8(utf8ToInt(str_sub(text_manual, 83, 83)))
#> [1] " "
(Note that your result for Encoding(text_manual) may change based on your locale)
To avoid this problem use \s in the reg-exp to match any whitespace character:
library(rvest)
library(stringr)
url <- "https://en.wikipedia.org/wiki/Linda_Evangelista"
text <- read_html(url) %>%
html_nodes(".infobox") %>%
html_text()
pattern <- "1\\.[0-9]{2}\\sm"
str_extract(text, pattern)
#> [1] "1.75 m"

pattern matching with sub(), unable to catch and replace first occurrence

The followings are the results I expect
> title = "La La Land (2016/I)"
[1]"(2016" #result
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
[1]"(2013" #result
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
[1]"(2015" #result
==================================================================
The followings are what I got by applying codesub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(1500-1800) (#1.1)" #result. However, I expected it to be "(2013)"
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(2016/I)" #result as I expect
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1]"(2017)" # result. However, I expect it to be "(2015)"
The followings are what I GOT by applying codesub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "La La Land (2016/I)" #result. However, I expect it to be "(2016)"
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2017)" #result. However, I expect it to be "(2015)"
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2013)" #result as I expect
I checked the description of sub, it says "sub performs replacement of the first match. In this case, the first match should be (2013).
In a word, I try to write a sub()command to return the first occurrence of a year in a string.
I guess there is something wrong with my code but couldn't find it, appreciate if anyone could help me.
==================================================================
In fact, my ultimate goal is to extract the year of all movies. However, I don't know how to do it in one step. Therefore, I decide to first find the year in (dddd format, then use code sub(pattern="\\((\\d{4}).*", a, replacement="\\1") to find the pure number of the year.
for example:
> a= "(2015"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
> a= "(2015)"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
=================updated 05/29/2017 22:51PM=======================
the str_extract in akrun's answer works well with my dataset.
However, the sub() doesn't work for all data. The following are what I did. However, my code doesn't work with all 500 records. I would really appreciate if anyone could point out the mistakes on my codes. I really cannot figure it out myself. Thank you very much.
> t1
[1] "Man Who Fell to Earth (Remix) (2010) (TV)"
> t2
[1] "Manual pr\u0087ctico del amigo imaginario (abreviado) (2008)"
> title = c(t1,t2)
> x=gsub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> x
[1] "(2010) (TV)" "(2008)"
> sub(pattern="\\((.*)\\).*", x, replacement="\\1")
[1] "2010) (TV" "2008"
However, my goal is to get 2010 and 2008. My code works with t2 but fails with t1
We can match 0 or more characters that are not a ( ([^(]*) from the start (^) of the string, followed by a ( and four digits (\\([0-9]{4}) which we capture as a group ((...)) followed by other characters (.*) and replace with the backreference (\\1) of the captured group
sub("^[^(]*(\\([0-9]{4}).*", "\\1", title)
#[1] "(2016" "(2013" "(2015"
If we need to remove the (, then capture only the numbers that follows the \\( as a group
sub("^[^(]*\\(([0-9]{4}).*", "\\1", title)
#[1] "2016" "2013" "2015"
Or with str_extract, we use a regex lookaround to extract the 4 digit numbers that follows the (
library(stringr)
str_extract(title, "(?<=\\()[0-9]{4}")
#[1] "2016" "2013" "2015"
Or with regmatches/regexpr
regmatches(title, regexpr("(?<=\\()([0-9]{4})", title, perl = TRUE))
#[1] "2016" "2013" "2015"
data
title <- c("La La Land (2016/I)",
"_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_",
"dfajfj(2015)asdfjuwer f(2017)fa.erewr6")

Extracting hashtags in several tweets using R

I desperately want a solution to extracting hashtags from collective tweets in R.
For example:
[[1]]
[1] "RddzAlejandra: RT #NiallOfficial: What a day for #johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle"
[[2]]
[1] "BPOInsight: RT #atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012"
[[3]]
[1] "BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech"
How can I parse it to extract the list of hashtag words in all the tweets.
Previous solutions display only hashtags in the first tweet with these error messages in the code:
> string <-"MonicaSarkar: RT #saultracey: Sun kissed #olmpicrings at #towerbridge #london2012 # Tower Bridge http://t.co/wgIutHUl"
>
> [[2]]
Error: unexpected '[[' in "[["
> [1] "ccrews467: RT #BBCNews: England manager Roy Hodgson calls #London2012 a \"wake-up call\": footballers and fans should emulate spirit of #Olympics http://t.co/wLD2VA1K"
Error: unexpected '[' in "["
> hashtag.regex <- perl("(?<=^|\\s)#\\S+")
> hashtags <- str_extract_all(string, hashtag.regex)
> print(hashtags)
[[1]]
[1] "#olmpicrings" "#towerbridge" "#london2012"
Using regmatches and gregexpr this gives you a list with hashtags per tweet, assuming hastag is of format # followed by any number of letters or digits (I am not that familiar with twitter):
foo <- c("RddzAlejandra: RT #NiallOfficial: What a day for #johnJoeNevin ! Sooo proud t have been there to see him at #London2012 and here in mgar #MullingarShuffle","BPOInsight: RT #atos: Atos completes delivery of key IT systems for London 2012 Olympic Games http://t.co/Modkyo2R #london2012","BloombergWest: The #Olympics sets a ratings record for #NBC, with 219M viewers tuning in. http://t.co/scGzIXBp #london2012 #tech")
regmatches(foo,gregexpr("#(\\d|\\w)+",foo))
Returns:
[[1]]
[1] "#London2012" "#MullingarShuffle"
[[2]]
[1] "#london2012"
[[3]]
[1] "#Olympics" "#NBC" "#london2012" "#tech"
How about a strsplit and grep version:
> lapply(strsplit(x, ' '), function(w) grep('#', w, value=TRUE))
[[1]]
[1] "#London2012" "#MullingarShuffle"
[[2]]
[1] "#london2012"
[[3]]
[1] "#Olympics" "#NBC," "#london2012" "#tech"
I couldn't figure out how to return multiple results from each string without first splitting, but I bet there is a way!

Resources