How to keep non-alphanumeric symbols when tokenizing words in R? - r

I am using the tokenizers package in R for tokenizing a text, but non-alphanumeric symbols such as "#" or "&" are lost and I need to keep them. Here is the function I am using:
tokenize_ngrams("My number & email address user#website.com", lowercase = FALSE, n = 3, n_min = 1,stopwords = character(), ngram_delim = " ", simplify = FALSE)
I know tokenize_character_shingles has the strip_non_alphanum argument that allows keeping the punctuation, but the tokenization is applied to characters, not words.
Anyone knows how to handle this issue?

If you are okay to use a different package ngram, this has two useful functions that retains those non-alpha
> library(ngram)
> print(ngram("My number & email address user#website.com",n = 2), output = 'full')
number & | 1
email {1} |
My number | 1
& {1} |
address user#website.com | 1
NULL {1} |
& email | 1
address {1} |
email address | 1
user#website.com {1} |
> print(ngram_asweka("My number & email address user#website.com",1,3), output = 'full')
[1] "My number &" "number & email"
[3] "& email address" "email address user#website.com"
[5] "My number" "number &"
[7] "& email" "email address"
[9] "address user#website.com" "My"
[11] "number" "&"
[13] "email" "address"
[15] "user#website.com"
>
Another beautiful package quanteda gives more flexibility with remove_punct paramater.
> library(quanteda)
> tokenize(text, ngrams = 1:3)
tokenizedTexts from 1 document.
Component 1 :
[1] "My" "number"
[3] "&" "email"
[5] "address" "user#website.com"
[7] "My_number" "number_&"
[9] "&_email" "email_address"
[11] "address_user#website.com" "My_number_&"
[13] "number_&_email" "&_email_address"
[15] "email_address_user#website.com"
>

Related

How to write a regex pattern to extract the location from AMT withdrawal Transactions in a bank statement

I wish to write a regex pattern to extract the address or location from a string of narration for the data of 350k records.
txn_add <- data.frame(NARRATION=c("$ $ $ +YBL PATAUDI CHOWK \ $",
"$ $ -ATM CASH 83181 + MAIN BHAWANA ROAD NEW DELHI $",
"$ $ [5839/P1TNDE06/+RAGHUBARPURA $",
"$ MAXIMUMOUTFITS PRIVATE LIMITED } $ ATDELHIIN- $ $ /5631 $",
"$ ATM CASH-N4077800-+SPRINGFIELDCOLONYFFAR IDABADHRIN-04/06/18 $ /5631 ( $ $ VERIFICATION $"))
I ran the following regex pattern:
gsub(".*[:|+]([^.]+)[$|\\|\\/].*", "\\1", txn_add$NARRATION)
And i got the output as :
[1] "YBL PATAUDI CHOWK "
[2] " MAIN BHAWANA ROAD NEW DELHI "
[3] "RAGHUBARPURA "
[4] "$ MAXIMUMOUTFITS PRIVATE LIMITED } $ ATDELHIIN- $ $ /5631 $"
[5] "SPRINGFIELDCOLONYFFAR IDABADHRIN-04/06/18 $ /5631 ( $ $ VERIFICATION "
This output is not correct as I have to implement some conditions:
Address can start from :
1. '+'
2. '#'
3. ' AT '
4. ':'
5. <P|S><SBI><P|S> # EXACT TEXT PRECEEDED AND FOLLOWED BY PUNCTUATION OR SPACE
6. <NNN> FOLLOWED BY <P|S|A> # 3 NUMBERS FOLLOWED BY EITHER PUNCTUATION OR SPACE OR ALPHA
And End with :
1. -
2. /
3. $
4. \
5.<NNNNNNN> # Combination of numbers
CAN CONTAIN
Alphabets, numbers, dot (.), dash (-),space ( ), coma(,),underscore (_) brackets(()) at (#), hash (#) and(&) semi colon (;)
This is to extract the address from the transaction & Desired Output will be:
[1] "YBL PATAUDI CHOWK"
[2] "MAIN BHAWANA ROAD NEW DELHI "
[3] "RAGHUBARPURA "
[4] "DELHIIN"
[5] "SPRINGFIELDCOLONYFFAR IDABADHRIN"
I am not able to get the desired output. What can I try next?
You might use a capture group
(?:[+#:]|\bAT(?!M))\s*([A-Z]+(?:\s+[A-Z]+)*)
Explanation
(?: Non capture group
[+#:] Match one of + # :
| Or
\bAT(?!M) Match AT not followed by M
) Close group
\s* Match 0+ whitespace chars
( Capture group 1
[A-Z]+(?:\s+[A-Z]+)* Match chars A-Z with 1+ whitespace chars in between
) Close group 1
See a regex demo
With sub matching all before and after the group:
sub(".*(?:[+#:]|\\bAT(?!M))\\s*([A-Z]+(?:\\s+[A-Z]+)*).*", "\\1", txn_add$NARRATION, perl=TRUE)

add text to atomic (character) vector in r

Good afternoon, I am not an expert in the topic of atomic vectors but I would like some ideas about it
I have the script for the movie "Coco" and I want to be able to get a row that is numbered in the form 1., 2., ... (130 scenes throughout the movie). I want to convert the line of each scene of the movie into a row that contains "Scene 1", "Scene 2", up to "Scene 130" and achieve it sequentially.
url <- "https://www.imsdb.com/scripts/Coco.html"
coco <- read_lines("coco2.txt") #after clean
class(coco)
typeof(coco)
" 48."
[782] " arms full of offerings."
[783] " Once the family clears, Miguel is nowhere to be seen."
[784] " INT. NEARBY CORRIDOR"
[785] " Miguel and Dante hide from the patrolman. But Dante wanders"
[786] " off to inspect a side room."
[787] " INT. DEPARTMENT OF CORRECTIONS"
[788] " Miguel catches up to Dante. He overhears an exchange in a"
[789] " nearby cubicle."
[797] " 49."
[798] " And amigos, they help their amigos."
[799] " worth your while."
[800] " workstation."
[801] " Miguel perks at the mention of de la Cruz."
[809] " Miguel follows him."
[810] " 50." # Its scene number
[811] " INT. HALLWAY"
s <- grep(coco, pattern = "[^Level].[0-9].$", value = TRUE)
My solution is wrong because it is not sequential
v <- gsub(s, pattern = "[^Level].[0-9].$", replacement = paste("Scene", sequence(1:130)))
[1] " Scene1"
[2] " Scene1"
[3] " Scene1"
[4] " Scene1"
[5] " Scene1"
[6] " Scene1"
I'm not clear on what [^Level] represents. However, if the numbers at the end of lines in the text represent the Scene numbers, then you can use ( ) to capture the numbers and substitute them in your replacement text as shown below:
v <- gsub(s, pattern = " ([0-9]{1,3})\\.$", replacement = "Scene \\1")

How to allow a space into a wildcard?

Let's say I have this sentence :
text<-("I want to find both the greatest cake of the world but also some very great cakes but I want to find this last part : isn't it")
When I write this (kwicis a quantedafunction) :
kwic(text,phrase("great* cake*"))
I get
[text1, 7:8] want to find both the | greatest cake | of the world but also
[text1, 16:17] world but also some very | great cakes | but I want to find
However, when I do
kwic(text,phrase("great*cake*"))
I get a kwicobject with 0 row, i.e. nothing
I would like to know what does the *replace exactly and, more important, how to "allow" a space to be taken into account in the wildcard ?
To answer what the * matches, you need to understand the "glob" valuetype, which you can read about using ?valuetype and also here. In short, * matches any number of any characters including none. Note that this is very different from its use in a regular expression, which means "match none or more of the preceding character".
The pattern argument in kwic() matches one pattern per token, after tokenizing the text. Even wrapped in the phrase() function, it still only considers sequences of matches to tokens. So you cannot match the whitespace (which defines the boundaries between tokens) unless you actually include these inside the token's value itself.
How could you do that? Like this:
toksbi <- tokens(text, ngrams = 2, concatenator = " ")
# tokens from 1 document.
# text1 :
# [1] "I want" "want to" "to find" "find both" "both the"
# [6] "the greatest" "greatest cake" "cake of" "of the" "the world"
# [11] "world but" "but also" "also some" "some very" "very great"
# [16] "great cakes" "cakes but" "but I" "I want" "want to"
# [21] "to find" "find this" "this last" "last part" "part :"
# [26] ": isn't" "isn't it"
kwic(toksbi, "great*cake*", window = 2)
# [text1, 7] both the the greatest | greatest cake | cake of of the
# [text1, 16] some very very great | great cakes | cakes but but I
But your original usage of kwic(text, phrase("great* cake*")) is the recommended approach.

Using R package ggmap with Google Directions API token

I'm using ggmap's "route" function to plot driving directions using Google's Driving Directions API. I may need to exceed the 2,500 daily API limit. I know that to exceed that threshold, I'll have to pay the $0.50/1000 hits fee and sign up for a Google Developers API token. However, I don't see any parameters in the ggmap library or route function that allow me to enter my token and key information so that I can exceed the threshold. What am I missing?
I've written the package googleway to access google maps API where you can specify your token key
For example
library(googlway)
key <- "your_api_key"
google_directions(origin = "MCG, Melbourne",
destination = "Flinders Street Station, Melbourne",
key = key,
simplify = F) ## use simplify = T to return a data.frame
[1] "{"
[2] " \"geocoded_waypoints\" : ["
[3] " {"
[4] " \"geocoder_status\" : \"OK\","
[5] " \"partial_match\" : true,"
[6] " \"place_id\" : \"ChIJIdtrbupC1moRMPT0CXZWBB0\","
[7] " \"types\" : ["
[8] " \"establishment\","
[9] " \"point_of_interest\","
[10] " \"train_station\","
[11] " \"transit_station\""
[12] " ]"
[13] " },"
[14] " {"
[15] " \"geocoder_status\" : \"OK\","
[16] " \"place_id\" : \"ChIJSSKDr7ZC1moRTsSnSV5BnuM\","
[17] " \"types\" : ["
[18] " \"establishment\","
[19] " \"point_of_interest\","
[20] " \"train_station\","
[21] " \"transit_station\""
[22] " ]"
[23] " }"
[24] " ],"
[25] " \"routes\" : ["
... etc

How to print double quotes (") in R

I want to print to the screen double quotes (") in R, but it is not working. Typical regex escape characters are not working:
> print('"')
[1] "\""
> print('\"')
[1] "\""
> print('/"')
[1] "/\""
> print('`"')
[1] "`\""
> print('"xml"')
[1] "\"xml\""
> print('\"xml\"')
[1] "\"xml\""
> print('\\"xml\\"')
[1] "\\\"xml\\\""
I want it to return:
" "xml" "
which I will then use downstream.
Any ideas?
Use cat:
cat("\" \"xml\" \"")
OR
cat('" "','xml','" "')
Output:
" "xml" "
Alternative using noqoute:
noquote(" \" \"xml\" \" ")
Output :
" "xml" "
Another option using dQoute:
dQuote(" xml ")
Output :
"“ xml ”"
With the help of the print parameter quote:
print("\" \"xml\" \"", quote = FALSE)
> [1] " "xml" "
or
cat('"')

Resources