Extract youtube video ID from url with R stringr regex - r

I'm looking to extract only the video id string from a column of youtube links.
The stringr function I'm currently using is this:
str_extract(data$link, "\\b[^=]+$")
This works for most standard youtube links with the id at the end of the url appearing after an = sign i.e.
youtube.com/watch?v=kFF0v0FQzEI
However not all links follow this pattern, examples:
youtube.com/v/kFF0v0FQzEI
youtube.com/vi/kFF0v0FQzEI
youtu.be/kFF0v0FQzEI
www.youtube.com/v/kFF0v0FQzEI?feature=autoshare&version=3&autohide=1&autoplay=1
www.youtube.com/watch?v=kFF0v0FQzEI&list=PLuV2ACKGzAMsG-pem75yNYhBvXZcl-mj_&index=1
So could anyone help me out with an R regex pattern to extract the id (kFF0v0FQzEI in this case) in all the examples above?
I've seen examples of regex patterns used in other languages to do this but I'm unsure how to convert to R compliance.
Thanks!

You could use something like the following, but note that it's pretty heavily hard-coded to the examples you provided.
links = c("youtube.com/v/kFF0v0FQzEI",
"youtube.com/vi/kFF0v0FQzEI",
"youtu.be/kFF0v0FQzEI",
"www.youtube.com/v/kFF0v0FQzEI?feature=autoshare&version=3&autohide=1&autoplay=1",
"www.youtube.com/watch?v=kFF0v0FQzEI&list=PLuV2ACKGzAMsG-pem75yNYhBvXZcl-mj_&index=1",
"youtube.com/watch?v=kFF0v0FQzEI",
"http://www.youtube.com/watch?argv=xyz&v=kFF0v0FQzEI")
get_id = function(link) {
if (stringr::str_detect(link, '/watch\\?')) {
rgx = '(?<=\\?v=|&v=)[\\w]+'
} else {
rgx = '(?<=/)[\\w]+/?(?:$|\\?)'
}
stringr::str_extract(link, rgx)
}
ids = unname(sapply(links, get_id))
# [1] "kFF0v0FQzEI" "kFF0v0FQzEI" "kFF0v0FQzEI" "kFF0v0FQzEI?"
# "kFF0v0FQzEI" "kFF0v0FQzEI" "kFF0v0FQzEI"

Related

Speech_contexts phrase list not working in google speech.SpeechAsyncClient.streaming_recognize

Unable to make speech_contexts phrase lists work with speech.SpeechAsyncClient in Google Speech to Text..
The transcription works, but the phrase list appears to be ignored.
Is there any config that needs to be in-place?
When Using the speech.SpeechAsyncClient (version 2.17.2 in python) I created a phrase list :
speech_contexts {
phrases: "Burrito"
boost: 10.0
}
speech_contexts {
phrases: "burrito"
boost: 5.0
}
I expected the word audio for 'burrito' to be transcribed as 'Burrito' as text. However it continued to be 'burrito'. Also I tried various phrase lists, but the recognition seems to ignore the phrase lists (same result with/without phrase list).
I verified that the proper speech_context is being sent in the 'streaming_config/Recogntionconfig like this:
Recognitionconfig = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
#encoding = cloud_speech.ExplicitDecodingConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
model="latest_long",
#enable_word_confidence=True,
speech_contexts=speech_contexts #this contains the phrase list
)
#The first message is the following streaming_config and is then followed by audio
streaming_config = speech.StreamingRecognitionConfig(
config=Recognitionconfig, interim_results=True
)
Try using model adaptation to strengthen the accuracy of your transcription results. It also uses RecognitionConfig for the request body. Also follow this format when using SpeechContext.
{
"phrases": [
string
],
"boost": number
}

In R: Search all emails by subject line, pull comma-separate values from body, then save values in a dataframe

Each day, I get an email with the quantities of fruit sold on a particular day. The structure of the email is as below:
Date of report:,04-JAN-2022
Time report produced:,5-JAN-2022 02:04
Apples,6
Pears,1
Lemons,4
Oranges,2
Grapes,7
Grapefruit,2
I'm trying to build some code in R that will search through my emails, find all emails with a particular subject, iterate through each email to find the variables I'm looking for, take the values and place them in a dataframe with the "Date of report" put in a date column.
With the assistance of people in the community, I was able to achieve the desired result in Python. However as my project has developed, I need to now achieve the same result in R if at all possible.
Unfortunately, I'm quite new to R and therefore if anyone has any advice on how to take this forward I would greatly appreciate it.
For those interested, my Python code is below:
#PREP THE STUFF
Fruit_1 = "Apples"
Fruit_2 = "Pears"
searchf = [
Fruit_1,
Fruit_2
]
#DEF THE STUFF
def get_report_vals(report, searches):
dct = {}
for line in report:
term, *value = line
if term.casefold().startswith('date'):
dct['date'] = pd.to_datetime(value[0])
elif term in searches:
dct[term] = float(value[0])
if len(dct.keys()) != len(searches):
dct.update({x: None for x in searches if x not in dct})
return dct
#DO THE STUFF
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
messages.Sort("[ReceivedTime]", True)
results = []
for message in messages:
if message.subject == 'FRUIT QUANTITIES':
if Fruit_1 in message.body and Fruit_2 in message.body:
data = [line.strip().split(",") for line in message.body.split('\n')]
results.append(get_report_vals(data, searchf))
else:
pass
fruit_vals = pd.DataFrame(results)
fruit_vals.columns = map(str.upper, fruit_vals.columns)
I'm probably going about this the wrong way, but I'm trying to use the steps I took in Python to achieve the same result in R. So for example I create some variables to hold the fruit sales I'm searching for, then I create a vector to store the searchables, and then when I create an equivalent 'get_vals' function, I create an empty vector.
library(RDCOMClient)
Fruit_1 <- "Apples"
Fruit_2 <- "Pears"
##Create vector to store searchables
searchf <- c(Fruit_1, Fruit_2)
## create object for outlook
OutApp <- COMCreate("Outlook.Application")
outlookNameSpace = OutApp$GetNameSpace("MAPI")
search <- OutApp$AdvancedSearch("Inbox", "urn:schemas:httpmail:subject = 'FRUIT QUANTITIES'")
inbox <- outlookNameSpace$Folders(6)$Folders("Inbox")
vec <- c()
for (x in emails)
{
subject <- emails(i)$Subject(1)
if (grepl(search, subject)[1])
{
text <- emails(i)$Body()
print(text)
break
}
}
read.table could be a good start for get_report_vals.
Code below outputs result as a list, exception handling still needs to be implemented :
report <- "
Date of report:,04-JAN-2022
Apples,6
Pears,1
Lemons,4
Oranges,2
Grapes,7
Grapefruit,2
"
get_report_vals <- function(report,searches) {
data <- read.table(text=report,sep=",")
colnames(data) <- c('key','value')
# find date
date <- data[grepl("date",data$key,ignore.case=T),"value"]
# transform dataframe to list
lst <- split(data$value,data$key)
# output result as list
c(list(date=date),lst[searches])
}
get_report_vals(report,c('Lemons','Oranges'))
$date
[1] "04-JAN-2022"
$Lemons
[1] "4"
$Oranges
[1] "2"
The results of various reports can then be concatenated in a data.frame using rbind:
rbind(get_report_vals(report,c('Lemons','Oranges')),get_report_vals(report,c('Lemons','Oranges')))
date Lemons Oranges
[1,] "04-JAN-2022" "4" "2"
[2,] "04-JAN-2022" "4" "2"
The code now functions as intended. Function was written quite a bit differently from those recommended:
get_vals <- function(email) {
body <- email$body()
date <- str_extract(body, "\\d{2}-[:alpha:]{3}-\\d{4}") %>%
as.character()
data <- read.table(text = body, sep = ",", skip = 9, strip.white = T) %>%
row_to_names(1) %>%
mutate("Date" = date)
return(data)
}
In addition I've written this to bind the rows together:
info <- sapply(results, get_vals, simplify = F) %>%
bind_rows()
May this is not what you are expecting to get as an answer, but I must state that here to help other readers to avoid such mistakes in future.
Unfortunately your Python code is not well-written. For example, I've noticed the following code where you iterate over all items in a folder and check the Subject and message bodies for keywords:
for message in messages:
if message.subject == 'FRUIT QUANTITIES':
if Fruit_1 in message.body and Fruit_2 in message.body:
You need to use the Find/FindNext or Restrict methods of the Items class instead. So, you don't need to iterate over all items in a folder. Instead, you get only items that correspond to your conditions. Read more about these methods in the following articles:
How To: Use Find and FindNext methods to retrieve Outlook mail items from a folder (C#, VB.NET)
How To: Use Restrict method to retrieve Outlook mail items from a folder
You may combine all your search criteria into a single query. So, you just need to iterate over found items and extract the data.
Also you may find the AdvancedSearch method helpful. The key benefits of using the AdvancedSearch method in Outlook are:
The search is performed in another thread. You don’t need to run another thread manually since the AdvancedSearch method runs it automatically in the background.
Possibility to search for any item types: mail, appointment, calendar, notes etc. in any location, i.e. beyond the scope of a certain folder. The Restrict and Find/FindNext methods can be applied to a particular Items collection (see the Items property of the Folder class in Outlook).
Full support for DASL queries (custom properties can be used for searching too). You can read more about this in the Filtering article in MSDN. To improve the search performance, Instant Search keywords can be used if Instant Search is enabled for the store (see the IsInstantSearchEnabled property of the Store class).
You can stop the search process at any moment using the Stop method of the Search class.
See Advanced search in Outlook programmatically: C#, VB.NET for more information.

How to Change Part of URL With a Function Input in R?

Let's say we have a url in R like:
url <- 'http://google.com/maps'
And the objective is to change the 'maps' part of it. I'd like to write a function where basically I can just input something (e.g. 'maps', 'images'), etc., and the relevant part of the url will automatically change to reflect what I'm typing in.
Is there a way to do this in R, where part of the url can be changed by typing something into a function?
Thanks!
You have to store the part you type into a variable and paste this to the base URL:
base_url <- "http://google.com/"
your_extension <- "maps"
paste0(base_url, your_extension)
[1] "http://google.com/maps"
If you have to start with a fixed URL, use sub to replace the last part:
sub("\\w+$", 'foo', url)
# "http://google.com/foo"
You can use dirname to remove the last part of the URL and paste it with additional custom string.
change_url_part <- function(base_url, string) {
paste(dirname(base_url), string, sep = '/')
}
change_url_part('http://google.com/maps', 'images')
#[1] "http://google.com/images"

Can R read html-encoded emoji characters?

Question
My question, explained below, is:
How can R be used to read a string that includes HTML emoji codes like πŸ€—?
I'd like to:
(1) represent the emoji symbol (e.g., as a unicode symbol: πŸ€—) in the parsed string, OR(2) convert it into its text equivalent (":hugging face:")
Background
I have an XML dataset of text messages (from the Android/iOS app Signal) that I am reading into R for a text mining project. The data look like this, with each text message represented in an sms node:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+15555555555" contact_name="Jane Doe" date="1483256850399" readable_date="Sat, 31 Dec 2016 23:47:30 PST" type="1" subject="null" body="Hug emoji: πŸ€—" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>
Problem
I am currently reading the data using the xml2 package for R. When I use the xml2::read_xml function, however, I get the following error message:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
xmlParseCharRef: invalid xmlChar value 55358
Which, as I understand, indicates that the emoji character is not recognized as valid XML.
Using the xml2::read_html function does work, but drops the emoji character. A small example of this is here:
example_text <- "Hugging emoji: πŸ€—"
xml2::xml_text(xml2::read_html(paste0("<x>", example_text, "</x>")))
(Output: [1] "Hugging emoji: ")
This character is valid HTML -- Googling πŸ€— actually converts it in the search bar to the "hugging face" emoji, and brings up results relating to that emoji.
Other information I've found that seems relevant to this question
I've been searching Stack Overflow, and have not found any questions relating to this particular issue. I've also not been able to find a table that straightforwardly gives HTML codes next to the emoji they represent, and so am not able to do an (albeit inefficient) conversion of these HTML codes to their textual equivalents in a big loop before parsing the dataset; for example, neither this list nor its underlying dataset seem to include the string 55358.
tl;dr: the emoji aren't valid HTML entities; UTF-16 numbers have been used to build them instead of Unicode code points. I describe an algorithm at the bottom of the answer to convert them so that they are valid XML.
Identifying the Problem
R definitely handles emoji:
In fact, a few packages exist for handling emoji in R. For example, the emojifont and emo packages both let you retrieve emoji based on Slack-style keywords. It's just a question of getting your source characters through from the HTML-escaped format so that you can convert them.
xml2::read_xml seems to do fine with other HTML entities, like an ampersand or double quotes. I looked at this SO answer to see whether there were any XML-specific constraints on HTML entities, and it seemed like they were storing emoji fine. So I tried changing the emoji codes in your reprex to the ones in that answer:
body="Hug emoji: πŸ˜€πŸ˜ƒ"
And, sure enough, they were preserved (though they're obviously not the hug emoji anymore):
> test8 = read_html('Desktop/test.xml')
> test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body')
[1] "Hug emoji: \U0001f600\U0001f603"
I looked up the hug emoji on this page, and the decimal HTML entity given there is not πŸ€—. It looks like the UTF-16 decimal codes for the emoji have been wrapped in &# and ;.
In conclusion, I think the answer is that your emoji are, in fact, not valid HTML entities. If you can't control the source, you might need to do some pre-processing to account for these errors.
So, why does the browser convert them properly? I'm wondering if the browser is a little more flexible with these things and is making some guesses about what those codes could be. I'm just speculating, though.
Converting UTF-16 to Unicode code points
After some more investigation, it looks like valid emoji HTML entities use the Unicode code point (in decimal, if it's &#...;, or hex, if it's &#x...;). The Unicode code point is different from the UTF-8 or UTF-16 code. (That link explains a lot about how emoji and other characters are variously encoded, BTW! Good read.)
So we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16, I've verified how it's done. Each Unicode code point (our target) is a 20-bit number, or five hex digits. When going from Unicode to UTF-16, you split it up into two 10-bit numbers (the middle hex digit gets cut in half, with two of its bits going to each block), do some maths on them and get your result).
Going backwards, as you want to, it's done like this:
Your decimal UTF-16 number (which is in two separate blocks for now) is 55358 56599
Converting those blocks to hex (separately) gives 0x0d83e 0x0dd17
You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
Converting them to binary, padding them out to 10 bits and concatenating them, it's 0b0000 1111 1001 0001 0111
Then we convert that back to hex, which is 0x0f917
Finally, we add 0x10000, giving 0x1f917
Therefore, our (hex) HTML entity is πŸ€—. Or, in decimal, &#129303
So, to preprocess this dataset, you'll need to extract the existing numbers, use the algorithm above, then put the result back in (with one &#...;, not two).
Displaying emoji in R
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
JavaScript Solution
I had this exact same problem, but needed the solution in JavaScript, not R. Using rensa's comment above (hugely helpful!), I created the following code to solve this issue, and I just wanted to share it in case anyone else happens across this thread as I did, but needed it in JavaScript.
str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
And, here's a full snippet of it working if you'd like to run it:
var str = 'πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜'
str = str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
document.getElementById('result').innerHTML = str;
// πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜
// is turned into
// πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜
// which is rendered by the browser as the emojis
Original:<br>πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜<br><br>
Result:<br>
<div id='result'></div>
My SMS XML Parser application is working great now, but it stalls out on large XML files so, I'm thinking about rewriting it in PHP. If/when I do, I'll post that code as well.
I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
This is a quick and unpolished implementation of rensa's algorithm, but it works!
utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){
string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2]
string3a <- string_elements[1]
string3b <- string_elements[2]
string4a <- sprintf("0x0%x", as.numeric(string3a))
string4b <- sprintf("0x0%x", as.numeric(string3b))
string5a <- paste0(
# "0x",
as.hexmode(string4a) - 0xd800
)
string5b <- paste0(
# "0x",
as.hexmode(string4b) - 0xdc00
)
string6 <- paste0(
stringi::stri_pad(
paste0(BMS::hex2bin(string5a), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = ""),
stringi::stri_pad(
paste0(BMS::hex2bin(string5b), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = "")
)
string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]]))
string8 <- as.hexmode(string7) + 0x10000
unicode_pattern <- string8
unicode_pattern
}
make_unicode_entity <- function(x) {
paste0("\\U000", utf16_double_dec_code_to_utf8(x))
}
make_html_entity <- function(x) {
paste0("&#x", utf16_double_dec_code_to_utf8(x), ";")
}
# An example string, using the "hug" emoji:
example_string <- "test πŸ€— test"
output_string <- stringr::str_replace_all(
example_string,
"(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes.
make_unicode_entity
# make_html_entity
)
cat(output_string)
# To print Unicode string (doesn't display in R console, but can be copied and
# pasted elsewhere:
# (This assumes you've used 'make_unicode_entity' above in the str_replace_all
# call):
stringi::stri_unescape_unicode(output_string)
Translated Chad's JavaScript answer to Go since I too had the same issue, but needed a solution in Go.
https://play.golang.org/p/h9JBFzqcd90
package main
import (
"fmt"
"html"
"regexp"
"strconv"
"strings"
)
func main() {
emoji := "πŸ˜ŠπŸ˜˜πŸ˜€πŸ˜†πŸ˜‚πŸ˜"
regexp := regexp.MustCompile(`(&#\d+;){2}`)
matches := regexp.FindAllString(emoji, -1)
var builder strings.Builder
for _, match := range matches {
s := strings.Replace(match, "&#", "", -1)
parts := strings.Split(s, ";")
a := parts[0]
b := parts[1]
c, err := strconv.Atoi(a)
if err != nil {
panic(err)
}
d, err := strconv.Atoi(b)
if err != nil {
panic(err)
}
c = c - 0xd800
d = d - 0xdc00
e := strconv.FormatInt(int64(c), 2)
f := strconv.FormatInt(int64(d), 2)
g := "0000000000"[2:len(e)] + e
h := "0000000000"[10:len(f)] + f
j, err := strconv.ParseInt(g + h, 2, 64)
if err != nil {
panic(err)
}
k := j + 0x10000
_, err = builder.WriteString("&#x" + strconv.FormatInt(k, 16) + ";")
if err != nil {
panic(err)
}
}
fmt.Println(html.UnescapeString(emoji))
emoji = html.UnescapeString(builder.String())
fmt.Println(emoji)
}

Removing punctuations from text using R

I need to remove punctuation from the text. I am using tm package but the catch is :
eg: the text is something like this:
data <- "I am a, new comer","to r,"please help","me:out","here"
now when I run
library(tm)
data<-removePunctuation(data)
in my code, the result is :
I am a new comerto rplease helpmeouthere
but what I expect is:
I am a new comer to r please help me out here
Here's how I take your question, and an answer that is very close to #David Arenburg's in the comment above.
data <- '"I am a, new comer","to r,"please help","me:out","here"'
gsub('[[:punct:] ]+',' ',data)
[1] " I am a new comer to r please help me out here "
The extra space after [:punct:] is to add spaces to the string and the + matches one or more sequential items in the regular expression. This has the side effect, desirable in some cases, of shortening any sequence of spaces to a single space.
If you had something like
string <- "hello,you"
> string
[1] "hello,you"
You could do this:
> gsub(",", "", string)
[1] "helloyou"
It replaces the "," with "" in the variable called string

Resources