A way to add space between two unicode characters - r

I am using R to do an analysis of tweets and would like to include emojis in my analysis. I have read useful resources and consulted the emoji dictionaries from from both Jessica Peterka Bonetta and Kate Lyons. However, I am running into a problem when there are emojis right next to each other in tweets.
For example, if a use a Tweet with multiple emojis that are spread out, I will get the results I am looking for:
x <- iconv(x, from = "UTF8", to = "ASCII", sub = "byte")
x
x will return:
"Ummmm our plane <9c><88><8f> got delayed <9a><8f> and I<80><99>m kinda nervous <9f><98><96> but I<80><99>m on my way <9c><85> home <9f><8f> so that<80><99>s really exciting <80><8f> t<80>
Which when matching with Kate Lyons' emoji dictionary:
FindReplace(data = x, Var = "x", replaceData = emoticons, from="R_Encoding", to = "Name", exact = FALSE)
Will yield:
Ummmm our plane AIRPLANE got delayed WARNINGSIGN and I<80><99>m kinda nervous <9f><98><96> but I<80><99>m on my way WHITEHEAVYCHECKMARK home <9f><8f> so that<80><99>s really exciting DOUBLEEXCLAMATIONMARK t<80>
If there is a tweet with two emojis in a row, such as:
"Delayed\U0001f615\U0001f615\n.\n.\n.\n\n#flying #flight #travel #delayed #baltimore #january #flightdelay #travelproblems #bummer… "
Repeating the process with iconv from above will not work, because it will not match the codings in the emoji dictionary. Therefore, I thought of adding a space between the two patterns(\U0001f615\U0001f615) to become
(\U0001f615 \U0001f615), however I am struggling with a proper regular expression for this.

Related

R Split a title phrase into sub-phrases of a given maximum length

I have an R data frame where the columns have names such as the following:
"Goods excluding food purchased from stores and energy\nLast = 1.8"
"Books and reading material (excluding textbooks)\nLast = 136.1"
"Spectator entertainment (excluding video and audio subscription services)\nLast = -13.5"
There are a large number of columns. I want to insert newline characters where necessary, between words, so that these names consist of parts that are no longer than some given maximum, say MaxLen=18. And I want the last part, starting with the word "Last", to be on a separate line. In the three examples, the desired output is:
"Goods excluding\nfood purchased\nfrom stores and\nenergy\nLast = 1.8"
"Books and reading\nmaterial\n(excluding\ntextbooks)\nLast = 136.1"
"Spectator\nentertainment\n(excluding video\nand audio\nsubscription\nservices)\nLast = -13.5"
I have been trying to accomplish this with strsplit(), but without success. The parentheses and '=' sign may be part of my problem. The "\nLast = " portion is the same for all names.
Any suggestions much appreciated.
The strwrap function can help here, though you need to do a bit of work to keep the existing breaks. Consider this option
input <- c("Goods excluding food purchased from stores and energy\nLast = 1.8",
"Books and reading material (excluding textbooks)\nLast = 136.1",
"Spectator entertainment (excluding video and audio subscription services)\nLast = -13.5")
strsplit(input, "\n") |>
lapply(function(s) unlist(sapply(s, strwrap, 18))) |>
sapply(paste, collapse="\n")
# [1] "Goods excluding\nfood purchased\nfrom stores and\nenergy\nLast = 1.8"
# [2] "Books and reading\nmaterial\n(excluding\ntextbooks)\nLast = 136.1"
# [3] "Spectator\nentertainment\n(excluding video\nand audio\nsubscription\nservices)\nLast = -13.5"
Here we split the existing breaks, add new ones, then put it all back together.

Extracting a substring in R when the pattern is not that clear

I started R a week ago and I've been working on extracting some information from htmls to get started.
I know this is a frequent and basic question, because I've already asked it in a different context and I read quite a few threads.
I also know the functions I could use: sub / str_match, etc.
I chose to use sub() and here is what my code looks like for the time being:
#libraries
library('xml2')
library('rvest')
library('stringr')
#author page:
url <- paste('https://ideas.repec.org/e/',sample[4,3],'.html',sep="")
url <- gsub(" ", "", url, fixed = TRUE)
webpage <- read_html(url)
#get all published articles:
list_articles <- html_text(html_nodes(webpage,'#articles-body ol > li'))
#get titles:
titles <- html_text(html_nodes(webpage, '#articles-body b a'))
#get co-authors:
authors <- sub(".* ([A-Za-z_]+),([0-9]+).\n.*","\\1", list_articles)
Here is what an element of list_articles looks like:
" Theo Sparreboom & Lubna Shahnaz, 2007.\n\"Assessing Labour Market
Vulnerability among Young People,\"\nThe Pakistan Development
Review,\nPakistan Institute of Development Economics, vol. 46(3), pages 193-
213.\n"
When I try to get the co-authors, R gives me the whole string instead of just the co-authors, so I'm clearly specifying the pattern incorrectly, but I don't get why.
If someone could help me out, that would be great.
Hope you have a good day,
G. Gauthier
Is this helpful?
It says extract the string from the first upper case letter until there is a comma, space and then digit.
library(stringr)
#get co-authors:
authors <- str_extract(list_articles,"[[:upper:]].*(?=, [[:digit:]])")

Retaining unique identifiers (e.g., record id) when using tm functions - doesn't work with lot's of data?

I am working with unstructured text (Facebook) data, and am pre-processing it (e.g., stripping punctuation, removing stop words, stemming). I need to retain the record (i.e., Facebook post) ids while pre-processing. I have a solution that works on a subset of the data but fails with all the data (N = 127K posts). I have tried chunking the data, and that doesn't work either. I think it has something to do with me using a work-around, and relying on row names. For example, it appears to work with the first ~15K posts but when I keep subsetting, it fails. I realize my code is less than elegant so happy to learn better/completely different solutions - all I care about is keeping the IDs when I go to V Corpus and then back again. I'm new to the tm package and the readTabular function in particular. (Note: I ran the to lower and remove Words before making the VCorpus as I originally thought that was part of the issue).
Working code is below:
Sample data
fb = data.frame(RecordContent = c("I'm dating a celebrity! Skip to 2:02 if you, like me, don't care about the game.",
"Photo fails of this morning. Really Joe?",
"This piece has been almost two years in the making. Finally finished! I'm antsy for October to come around... >:)"),
FromRecordId = c(682245468452447, 737891849554475, 453178808037464),
stringsAsFactors = F)
Remove punctuation & make lower case
fb$RC = tolower(gsub("[[:punct:]]", "", fb$RecordContent))
fb$RC2 = removeWords(fb$RC, stopwords("english"))
Step 1: Create special reader function to retain record IDs
myReader = readTabular(mapping=list(content="RC2", id="FromRecordId"))
Step 2: Make my corpus. Read in the data using DataframeSource and the custom reader function where each FB post is a "document"
corpus.test = VCorpus(DataframeSource(fb), readerControl=list(reader=myReader))
Step 3: Clean and stem
corpus.test2 = corpus.test %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument, language = "english") %>%
as.VCorpus()
Step 4: Make the corpus back into a character vector. The row names are now the IDs
fb2 = data.frame(unlist(sapply(corpus.test2, `[`, "content")), stringsAsFactors = F)
Step 5: Make new ID variable for later merge, name vars, and prep for merge back onto original dataset
fb2$ID = row.names(fb2)
fb2$RC.ID = gsub(".content", "", fb2$ID)
colnames(fb2)[1] = "RC.stem"
fb3 = select(fb2, RC.ID, RC.stem)
row.names(fb3) = NULL
I think the ids are being stored and retained by default, by the tm module. You can fetch them all (in a vectorized manner) with
meta(corpus.test, "id")
$`682245468452447`
[1] "682245468452447"
$`737891849554475`
[1] "737891849554475"
$`453178808037464`
[1] "453178808037464"
I'd recommend to read the documentation of the the tm::meta() function, but it's not very good.
You can also add arbitrary metadata (as key-value pairs) to each collection item in the corpus, as well as collection-level metadata.

stri_replace_all_fixed slow on big data set - is there an alternative?

I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).
I have posted the code below. Does anyone know an alternative for doing this?
Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.
##Read in the dictionary
voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
word <- Corpus(VectorSource(voc))[1]
tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Structure of "voc" data frame:
Word Stem
1 abandonnere abandonner
2 abandonnerede abandonner
3 abandonnerende abandonner
...
313273 åsyns åsyn
To make a dictionary marching fast, you need to implement some clever data structures such as a prefix tree. 300000x search and replace just does not scale.
I don't think this will be efficient in R, but you will need to write a C or C++ extension. You have many tiny operations there, the overhead of the R interpreter will kill you when trying to do this in pure R.

word frequency scatterplot in R (words as labels)

I'm currently working on a paper comparing British MPs' roles in Parliament and their roles on twitter. I have collected twitter data (most importantly, the raw text) and speeches in Parliament from one MP and wish to do a scatterplot showing which words are common in both twitter and Parliament (top right hand corner) and which ones are not (bottom left hand corner). So, x-axis is word frequency in parliament, y-axis is word frequency on twitter.
So far, I have done all the work on this paper with R. I have ZERO experience with R, up until now I've only worked with STATA.
I tried adapting this code (http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package), but I just can't work it out. The main problem is that the person who wrote this code uses one text document and regular expressions to demarcate which text belongs on which axis. I however have two separate documents (I have saved them as .txt, corpi, or term-document-matrices) which should correspond to the separate axis.
I'm sorry that a novice such as myself is bothering you with this, and I will devote more time this year to learning the basics of R so that I could solve this problem by myself. However, this paper is due next Monday and I simply can't do so much backtracking right now to solve the problem.
I would be really grateful if you could help me,
thanks very much,
Nik
EDIT: I'll put in the code that I've made, even though it's not quite in the right direction, but that way I can offer a proper example of what I'm dealing with.
I have tried implementing is.R()s approach by using the text in question in a csv file, with a dummy variable to classify whether it is twitter text or speech text. i follow the approach, and at the end i even get a scatterplot, however, it plots the number ( i think it is the number at which the word is located in the dataset??) rather than the word. i think the problem might be that R is handling every line in the csv file as a seperate text document.
# in excel i built a csv dataset that contains all the text, each instance (single tweet / speech) in one line, with an added dummy variable that clarifies whether the text is a tweet or a speech ("istweet", 1=twitter).
comparison_watson.df <- read.csv(file="data/watson_combo.csv", stringsAsFactors = FALSE)
# now to make a text corpus out of the data frame
comparison_watson_corpus <- Corpus(DataframeSource(comparison_watson.df))
inspect(comparison_watson_corpus)
# now to make a term-document-matrix
comparison_watson_tdm <-TermDocumentMatrix(comparison_watson_corpus)
inspect(comparison_watson_tdm)
comparison_watson_tdm <- inspect(comparison_watson_tdm)
sort(colSums(comparison_watson_tdm))
table(colSums(comparison_watson_tdm))
termCountFrame_watson <- data.frame(Term = rownames(comparison_watson_tdm))
termCountFrame_watson$twitter <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 1, ])
termCountFrame_watson$speech <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 0, ])
head(termCountFrame_watson)
zp1 <- ggplot(termCountFrame_watson)
zp1 <- zp1 + geom_text(aes(x = twitter, y = speech, label = Term))
print(zp1)
library(tm)
txts <- c(twitter="bla bla bla blah blah blub",
speech="bla bla bla bla bla bla blub blub")
corp <- Corpus(VectorSource(txts))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- names(txts)
term.matrix <- as.data.frame(term.matrix)
library(ggplot2)
ggplot(term.matrix,
aes_string(x=names(txts)[1],
y=names(txts)[2],
label="rownames(term.matrix)")) +
geom_text()
You might also want to try out these two buddies:
library(wordcloud)
comparison.cloud(term.matrix)
commonality.cloud(term.matrix)
You are not posting a reproducible example so I cannot give you code but only pinpoint you to resources. Text scraping and processing is a bit difficult with R, but there are many guides. Check this and this . In the last steps you can get word counts.
In the example from One R Tip A Day you get the word list at d$word and the word frequency at d$freq

Resources