Extracting a substring in R when the pattern is not that clear - r

I started R a week ago and I've been working on extracting some information from htmls to get started.
I know this is a frequent and basic question, because I've already asked it in a different context and I read quite a few threads.
I also know the functions I could use: sub / str_match, etc.
I chose to use sub() and here is what my code looks like for the time being:
#libraries
library('xml2')
library('rvest')
library('stringr')
#author page:
url <- paste('https://ideas.repec.org/e/',sample[4,3],'.html',sep="")
url <- gsub(" ", "", url, fixed = TRUE)
webpage <- read_html(url)
#get all published articles:
list_articles <- html_text(html_nodes(webpage,'#articles-body ol > li'))
#get titles:
titles <- html_text(html_nodes(webpage, '#articles-body b a'))
#get co-authors:
authors <- sub(".* ([A-Za-z_]+),([0-9]+).\n.*","\\1", list_articles)
Here is what an element of list_articles looks like:
" Theo Sparreboom & Lubna Shahnaz, 2007.\n\"Assessing Labour Market
Vulnerability among Young People,\"\nThe Pakistan Development
Review,\nPakistan Institute of Development Economics, vol. 46(3), pages 193-
213.\n"
When I try to get the co-authors, R gives me the whole string instead of just the co-authors, so I'm clearly specifying the pattern incorrectly, but I don't get why.
If someone could help me out, that would be great.
Hope you have a good day,
G. Gauthier

Is this helpful?
It says extract the string from the first upper case letter until there is a comma, space and then digit.
library(stringr)
#get co-authors:
authors <- str_extract(list_articles,"[[:upper:]].*(?=, [[:digit:]])")

Related

How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?

In some cases, certain periods are mistakenly used as sentence breaks when using corpus_reshape. I have a corpus from the pharmaceutical industry and in many cases "Dr." is mistakenly used as a sentence break.
This post (Quanteda's corpus_reshape function: how not to break sentences after abbreviations (like "e.g.")) is similar but does unfortunately solve the problem. Here is an example:
library("quanteda")
txt <- c(
d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
d2 = "The U.S. is south of Canada."
)
corpus(txt) %>%
corpus_reshape(to = "sentences")
Corpus consisting of 4 documents.
d1.1 :
"With us we have Dr."
d1.2 :
"Smith."
d1.3 :
"We are not sure... where we stand."
d2.1 :
"The U.S. is south of Canada."
It works only for few cases with "Dr.". I was wondering if certain words to be excluded can be added to the function because I would like to avoid using an alternative function to break the text into sentences. Thanks!
Please use corpus_segment with pattern & valuetype = "regex".
You may find example here
https://quanteda.io/reference/corpus_segment.html
You may also use use_docvars option.

How to scrape data from multiple wikipedia pages with R?

I am new to data scraping in R, but I would like to do the following. I have a list of celebrities, celebs, and I would like to grab their date of birth from Wikipedia. I know how to do it for each individual celebrity, but I am trying to animate this process.
celebs <- c("Tom Hanks", "Tim Cook", "Michael Bloomberg")
I do the following to get the information I need for the first celebrity, Tom Hanks.
library(rvest)
wiki <- read_html("https://en.wikipedia.org/wiki/Tom_Hanks")
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
Is there a way to get the information I need for Tim Cook and Michael Bloomberg without manually editing the above code?
welcome to SO.
To do any task repeatedly with code, you should always look to build a loop. Before you can build a loop, you should try to build a single iteration of the loop. You almost have that ready here, but there are a few missing steps.
First of all, we should try to generalize the code so that it could work by simply switch the value of one variable from your vector of iterators (celebs).
person <- "Tom Hanks"
Now, using that, we need to create the wikipedia link through code. There are two things to consider here:
We need to add the link before the name of the person;
We should replace the space in "Tom Hanks" for an underline
We can do that with this code:
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
This creates the correct link, which we can use for the subsequent steps. Now, it is just a question of iterating through the celebs vector. There are many ways to do it, but in R, the most appropriate would be with an sapply. For that, we will create an anonymous function that will take a person's name as input, query wikipedia and extract their birthday, using the code that you have already written:
function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
}
You can now wrap an sapply structure around that:
birthdates <- sapply(celebs, function(person) {
link <- paste0("https://en.wikipedia.org/wiki/",
str_replace_all(person, " ", "_"))
wiki <- read_html(link)
birth_date <- wiki %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/table/tbody/tr[3]/td/text()') %>%
html_text()
return(birth_date)
})

Looping over a text collection to extract subchapters

As a continuation of my example here, I`m now confronted with the problem that I want to extract subchapters for all documents in my document collection in R for further Text Mining. This is my sample data:
doc_title <- c("Example.docx", "AnotherExample.docx")
text <- c("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
1 Introduction
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
1.1 Futher
The bedding was hardly able to cover it and seemed ready to slide off any moment.", "2.2 Futher Fuhter
'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.")
doc_corpus <- data.frame(doc_title, text)
This is the function to divide the text into subchapters:
divideInto_subchapters <- function(doc_corpus){
corpus_text <- doc_corpus$text
# Replace lines starting with N.N.N+ with space
corpus_text <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", corpus_text, perl=TRUE)
# Split into IDs and Texts
data <- str_match_all(corpus_text, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")
# Get the chapter ID column
chapter_id <- trimws(data[[1]][,2])
# Get the text ID column
text <- trimws(data[[1]][,3])
# Create the target DF
corpus <- data.frame(doc_title, chapter_id, text)
return(corpus)
}
Now I want to loop over all elements in my doc_corpus and divide all plain text into subchapters. This is what I tried out so far:
subchapter_corpus <- data.frame()
for (i in 1:nrow(doc_corpus)) {
temp_corpus <- divideInto_subchapters(doc_corpus[i])
subchapter_corpus <- rbind(subchapter_corpus, temp_corpus)
}
Unfortunately, this returns an empty data frame. What am I getting wrong here? Any help is highly appreciated.
My expected output for the first df row looks like this:
doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction"))
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))
chapter_one_df <- data.frame(doc_title, chapter_id, text)
So, for me the loop gave me "subscript out of bounds" until I changed doc_corpus[i] to doc_corpus[i, ]. With that change, I do get one row in the resulting data frame.
However, it's only chapter_id "2.2 Further Fuhter." It seems to be missing "1.1 Futher."
If it's a matter of the regex, then man it would sure help if you commented what you were doing with it! :)
Feel free to comment and I'll amend my answer as needed till it's helpful. Not sure if that's how it works, but this is only my 3rd day of answering questions on SO.

A way to add space between two unicode characters

I am using R to do an analysis of tweets and would like to include emojis in my analysis. I have read useful resources and consulted the emoji dictionaries from from both Jessica Peterka Bonetta and Kate Lyons. However, I am running into a problem when there are emojis right next to each other in tweets.
For example, if a use a Tweet with multiple emojis that are spread out, I will get the results I am looking for:
x <- iconv(x, from = "UTF8", to = "ASCII", sub = "byte")
x
x will return:
"Ummmm our plane <9c><88><8f> got delayed <9a><8f> and I<80><99>m kinda nervous <9f><98><96> but I<80><99>m on my way <9c><85> home <9f><8f> so that<80><99>s really exciting <80><8f> t<80>
Which when matching with Kate Lyons' emoji dictionary:
FindReplace(data = x, Var = "x", replaceData = emoticons, from="R_Encoding", to = "Name", exact = FALSE)
Will yield:
Ummmm our plane AIRPLANE got delayed WARNINGSIGN and I<80><99>m kinda nervous <9f><98><96> but I<80><99>m on my way WHITEHEAVYCHECKMARK home <9f><8f> so that<80><99>s really exciting DOUBLEEXCLAMATIONMARK t<80>
If there is a tweet with two emojis in a row, such as:
"Delayed\U0001f615\U0001f615\n.\n.\n.\n\n#flying #flight #travel #delayed #baltimore #january #flightdelay #travelproblems #bummer… "
Repeating the process with iconv from above will not work, because it will not match the codings in the emoji dictionary. Therefore, I thought of adding a space between the two patterns(\U0001f615\U0001f615) to become
(\U0001f615 \U0001f615), however I am struggling with a proper regular expression for this.

word frequency scatterplot in R (words as labels)

I'm currently working on a paper comparing British MPs' roles in Parliament and their roles on twitter. I have collected twitter data (most importantly, the raw text) and speeches in Parliament from one MP and wish to do a scatterplot showing which words are common in both twitter and Parliament (top right hand corner) and which ones are not (bottom left hand corner). So, x-axis is word frequency in parliament, y-axis is word frequency on twitter.
So far, I have done all the work on this paper with R. I have ZERO experience with R, up until now I've only worked with STATA.
I tried adapting this code (http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package), but I just can't work it out. The main problem is that the person who wrote this code uses one text document and regular expressions to demarcate which text belongs on which axis. I however have two separate documents (I have saved them as .txt, corpi, or term-document-matrices) which should correspond to the separate axis.
I'm sorry that a novice such as myself is bothering you with this, and I will devote more time this year to learning the basics of R so that I could solve this problem by myself. However, this paper is due next Monday and I simply can't do so much backtracking right now to solve the problem.
I would be really grateful if you could help me,
thanks very much,
Nik
EDIT: I'll put in the code that I've made, even though it's not quite in the right direction, but that way I can offer a proper example of what I'm dealing with.
I have tried implementing is.R()s approach by using the text in question in a csv file, with a dummy variable to classify whether it is twitter text or speech text. i follow the approach, and at the end i even get a scatterplot, however, it plots the number ( i think it is the number at which the word is located in the dataset??) rather than the word. i think the problem might be that R is handling every line in the csv file as a seperate text document.
# in excel i built a csv dataset that contains all the text, each instance (single tweet / speech) in one line, with an added dummy variable that clarifies whether the text is a tweet or a speech ("istweet", 1=twitter).
comparison_watson.df <- read.csv(file="data/watson_combo.csv", stringsAsFactors = FALSE)
# now to make a text corpus out of the data frame
comparison_watson_corpus <- Corpus(DataframeSource(comparison_watson.df))
inspect(comparison_watson_corpus)
# now to make a term-document-matrix
comparison_watson_tdm <-TermDocumentMatrix(comparison_watson_corpus)
inspect(comparison_watson_tdm)
comparison_watson_tdm <- inspect(comparison_watson_tdm)
sort(colSums(comparison_watson_tdm))
table(colSums(comparison_watson_tdm))
termCountFrame_watson <- data.frame(Term = rownames(comparison_watson_tdm))
termCountFrame_watson$twitter <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 1, ])
termCountFrame_watson$speech <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 0, ])
head(termCountFrame_watson)
zp1 <- ggplot(termCountFrame_watson)
zp1 <- zp1 + geom_text(aes(x = twitter, y = speech, label = Term))
print(zp1)
library(tm)
txts <- c(twitter="bla bla bla blah blah blub",
speech="bla bla bla bla bla bla blub blub")
corp <- Corpus(VectorSource(txts))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- names(txts)
term.matrix <- as.data.frame(term.matrix)
library(ggplot2)
ggplot(term.matrix,
aes_string(x=names(txts)[1],
y=names(txts)[2],
label="rownames(term.matrix)")) +
geom_text()
You might also want to try out these two buddies:
library(wordcloud)
comparison.cloud(term.matrix)
commonality.cloud(term.matrix)
You are not posting a reproducible example so I cannot give you code but only pinpoint you to resources. Text scraping and processing is a bit difficult with R, but there are many guides. Check this and this . In the last steps you can get word counts.
In the example from One R Tip A Day you get the word list at d$word and the word frequency at d$freq

Resources