I'm currently working on a paper comparing British MPs' roles in Parliament and their roles on twitter. I have collected twitter data (most importantly, the raw text) and speeches in Parliament from one MP and wish to do a scatterplot showing which words are common in both twitter and Parliament (top right hand corner) and which ones are not (bottom left hand corner). So, x-axis is word frequency in parliament, y-axis is word frequency on twitter.
So far, I have done all the work on this paper with R. I have ZERO experience with R, up until now I've only worked with STATA.
I tried adapting this code (http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package), but I just can't work it out. The main problem is that the person who wrote this code uses one text document and regular expressions to demarcate which text belongs on which axis. I however have two separate documents (I have saved them as .txt, corpi, or term-document-matrices) which should correspond to the separate axis.
I'm sorry that a novice such as myself is bothering you with this, and I will devote more time this year to learning the basics of R so that I could solve this problem by myself. However, this paper is due next Monday and I simply can't do so much backtracking right now to solve the problem.
I would be really grateful if you could help me,
thanks very much,
Nik
EDIT: I'll put in the code that I've made, even though it's not quite in the right direction, but that way I can offer a proper example of what I'm dealing with.
I have tried implementing is.R()s approach by using the text in question in a csv file, with a dummy variable to classify whether it is twitter text or speech text. i follow the approach, and at the end i even get a scatterplot, however, it plots the number ( i think it is the number at which the word is located in the dataset??) rather than the word. i think the problem might be that R is handling every line in the csv file as a seperate text document.
# in excel i built a csv dataset that contains all the text, each instance (single tweet / speech) in one line, with an added dummy variable that clarifies whether the text is a tweet or a speech ("istweet", 1=twitter).
comparison_watson.df <- read.csv(file="data/watson_combo.csv", stringsAsFactors = FALSE)
# now to make a text corpus out of the data frame
comparison_watson_corpus <- Corpus(DataframeSource(comparison_watson.df))
inspect(comparison_watson_corpus)
# now to make a term-document-matrix
comparison_watson_tdm <-TermDocumentMatrix(comparison_watson_corpus)
inspect(comparison_watson_tdm)
comparison_watson_tdm <- inspect(comparison_watson_tdm)
sort(colSums(comparison_watson_tdm))
table(colSums(comparison_watson_tdm))
termCountFrame_watson <- data.frame(Term = rownames(comparison_watson_tdm))
termCountFrame_watson$twitter <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 1, ])
termCountFrame_watson$speech <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 0, ])
head(termCountFrame_watson)
zp1 <- ggplot(termCountFrame_watson)
zp1 <- zp1 + geom_text(aes(x = twitter, y = speech, label = Term))
print(zp1)
library(tm)
txts <- c(twitter="bla bla bla blah blah blub",
speech="bla bla bla bla bla bla blub blub")
corp <- Corpus(VectorSource(txts))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- names(txts)
term.matrix <- as.data.frame(term.matrix)
library(ggplot2)
ggplot(term.matrix,
aes_string(x=names(txts)[1],
y=names(txts)[2],
label="rownames(term.matrix)")) +
geom_text()
You might also want to try out these two buddies:
library(wordcloud)
comparison.cloud(term.matrix)
commonality.cloud(term.matrix)
You are not posting a reproducible example so I cannot give you code but only pinpoint you to resources. Text scraping and processing is a bit difficult with R, but there are many guides. Check this and this . In the last steps you can get word counts.
In the example from One R Tip A Day you get the word list at d$word and the word frequency at d$freq
Related
I'm facing quiet a lot of challenges currently by doing text analysis with R.
Therefore I have in a table the columns Date, Text and Likes
I want to count how often a certain word occurs within the texts of a column (max 1 per column) and how often not.
I want to plot the results by displaying the result like in this picture
but I would like dots for "occurrence" and "not occurrence" of the searched word with different colors as dots and aggregate it monthly on y-axis and likes on x-axis
It would be great if you could help me with this challenge
As update I have here the sample data available https://drive.google.com/file/d/1IWqDoRFBTL8er8VmvisHDeB5uM3BGgJe/view?usp=sharing
It looks like there are several moving parts here so let me outline the tasks I think you are looking for assistance with:
Determine if a word appears in text, row by row.
Plot this information.
Display the information by category, i.e. word found or not found.
Provide some sort of smoothed fit over the data.
You can accomplish the first task by using your choice of pattern matching function. grepl for example will search with the pattern as its first argument. You may want to look into other parameters such as case sensitivity to ensure they match your needs. You'll want to store this result into another column, assuming you use ggplot. Then, you can pass the data to ggplot and use the col argument to have it separate out categories for you.
It doesn't appear that your data is readily available from your question. In the future, it generally helps if you can share some sample data. I have made my own sample which should be similar to what you describe. See the example code below.
library(tidyverse)
library(ggplot2)
set.seed(5)
data <- data.frame(Date = seq.Date(from = as.Date("2021-01-01"),
to = as.Date("2021-03-01"),
by = "day"),
fruit = sample(c("banana", "orange", "apple")),
likes = runif(60, 100, 1000))
data$good_fruit <- ifelse(grepl("orange", data$fruit), "orange", "not orange")
data %>%
ggplot() +
geom_point(aes(Date, likes, col = good_fruit)) +
geom_smooth(aes(Date, likes))
Since I threw together literally random data, there is not much a pattern here, but I think this illustrates the general idea of what you wanted to show? If you wanted a more specific kind of aggregation, I would recommend performing that manipulation before passing to ggplot, but for a rough fit this should work.
Sample Image
As a continuation of my example here, I`m now confronted with the problem that I want to extract subchapters for all documents in my document collection in R for further Text Mining. This is my sample data:
doc_title <- c("Example.docx", "AnotherExample.docx")
text <- c("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
1 Introduction
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
1.1 Futher
The bedding was hardly able to cover it and seemed ready to slide off any moment.", "2.2 Futher Fuhter
'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.")
doc_corpus <- data.frame(doc_title, text)
This is the function to divide the text into subchapters:
divideInto_subchapters <- function(doc_corpus){
corpus_text <- doc_corpus$text
# Replace lines starting with N.N.N+ with space
corpus_text <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", corpus_text, perl=TRUE)
# Split into IDs and Texts
data <- str_match_all(corpus_text, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")
# Get the chapter ID column
chapter_id <- trimws(data[[1]][,2])
# Get the text ID column
text <- trimws(data[[1]][,3])
# Create the target DF
corpus <- data.frame(doc_title, chapter_id, text)
return(corpus)
}
Now I want to loop over all elements in my doc_corpus and divide all plain text into subchapters. This is what I tried out so far:
subchapter_corpus <- data.frame()
for (i in 1:nrow(doc_corpus)) {
temp_corpus <- divideInto_subchapters(doc_corpus[i])
subchapter_corpus <- rbind(subchapter_corpus, temp_corpus)
}
Unfortunately, this returns an empty data frame. What am I getting wrong here? Any help is highly appreciated.
My expected output for the first df row looks like this:
doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction"))
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))
chapter_one_df <- data.frame(doc_title, chapter_id, text)
So, for me the loop gave me "subscript out of bounds" until I changed doc_corpus[i] to doc_corpus[i, ]. With that change, I do get one row in the resulting data frame.
However, it's only chapter_id "2.2 Further Fuhter." It seems to be missing "1.1 Futher."
If it's a matter of the regex, then man it would sure help if you commented what you were doing with it! :)
Feel free to comment and I'll amend my answer as needed till it's helpful. Not sure if that's how it works, but this is only my 3rd day of answering questions on SO.
I just startet learning R but I already have my first problem. I want to disply my data in a graph. My data is in an Excel sheet converted to a .csv sheet. But I have some chemical formulars like Fe2O3 in my data and with the .csv all subscripst are gone. That doesn't look very nice. Is there any way to get the subscripts from the original Excel file into R?
I would really appreciate your help :)
Edit: My data contains 6 chemical formulars displayed on the x-axis, which all contain subscripts (i.e. Fe2O3, ZnCl2, CO2, ...) and nummeric values displayed on the y-axis. The graph is a bar chart. I am not sure if there is a way to either change the numbers to subscipts in R or keep them prior to the import.
The graph looks like this. But I would like to have the numbers as subscripts:
I don't know that there's a way to bring the formatting from excel into a CSV and then R, unless you can make those subscripts using unicode. UTF8 symbols for subscript letters
Given that your list of chemicals is short, it's not much work to tweak the chemical names to help ggplot interpret them with subscripts. You'll want brackets around the numbers, plus tildes afterwards if there are more elements to include. Then we also tell scale_x_discrete to "parse" the labels and convert those symbols to formatting.
set.seed(42)
chem_df <- tibble(
Chemicals =
c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2"),
Chemicals_parsed =
c("AgNO[3]", "Al[2]~SiO[5]", "CO[2]", "Fe[2]~O[3]", "FeSO[4]", "ZnCl[2]"),
Mean = rnorm(6, 50, 30))
ggplot(chem_df, aes(x=Chemicals_parsed, Mean)) + geom_col() +
scale_x_discrete(name = "Chemicals",
labels=parse(text=chem_df$Chemicals_parsed))
To add to the excellent answer of #JonSpring, you can write a function which will convert strings like ""Al2SiO5" to strings like "Al[2]~SiO[5]", so you don't have to manually make all the conversions:
library(stringr)
chem.form <- function(s){
s <- str_replace_all(s,"([0-9]+)","[\\1]~")
if(endsWith(s,"~")) s <- substr(s,1,nchar(s) - 1)
s
}
Chemicals <- c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2")
Chemicals_parsed <- as.vector(sapply(Chemicals,chem.form))
In my excel file, the name of two of my variables are 2B and 3B, which means doubles and triples in baseball. However, when using corrplot, it shows up as X2B and X3B. I assume this is because it thinks I want to do multiplication. How would I go about fixing this?
I tried changing the box in excel from general format to text.
Any help would be much appreciated.
EDIT:
I got this part figured out. So now I have:
baseball = read.csv(file="MultComp3.csv",row.names=1)
library(corrplot)
M <- cor(baseball)[1:16,1:16]
colnames(M) <- c("Age","Runs\nPer\nGame","Hits","Doubles","Triples",
"Home Runs","RBI","Stolen\nBases","Walks","Strike\nOuts",
"Batting\nAverage","On-Base\nPercentage","Slugging\nPercentage","OPS","OPS+","Total\nBases")
rownames(M) <- c("Age","Runs\nPer\nGame","Hits","Doubles","Triples",
"Home Runs","RBI","Stolen\nBases","Walks","Strike\nOuts",
"Batting\nAverage","On-Base\nPercentage","Slugging\nPercentage","OPS","OPS+","Total\nBases")
corrplot.mixed(M)
EDIT 2:
But now, I need to make the text smaller, because it comes out of the boxes.
I'm sorry if this has been covered but I can't find a comparable question that's helpful here or anywhere else that isn't much too complicated for my beginner self. I just started learning R and am trying a practice problem that is literally identical to the one I was working on from the textbook, just with Jane Austen instead of Melville. Right now I'm trying to establish the beginning and end of the text so I can get rid of the metadata and be left with just the text. Here is my code:
# import the file
text.v <- scan("data/plainText/austen.txt", what="character", sep="\n")
# find the first and last sentences
start.v <- which(text.v == "CHAPTER 1. The family of Dashwood")
end.v <- which(text.v == "live twenty years longer.")
When I run it and then enter start.v or end.v into the console, I get integer(0).
However, the comparable code with the Melville returns the proper values.
#load text file
text.v <- scan("data/plainText/melville.txt", what="character", sep="\n")
text.v #view whole book
text.v[1] #view first line of book, as separated by \n
#create "bookmarks" to show the beginning and end of text
start.v <- which(text.v == "CHAPTER 1. Loomings.")
end.v <- which(text.v == "orphan.")