Separating txt (conversation) into columns with speaker names as variables - r

I'm new to text mining in R. I have multiple txt files of conversations between the same speakers organized as follows:
speaker one [speakers' names are on their own line]
what speaker one says [paragraph of each speaker's speech after
line break from name]
[empty line]
speaker two
what speaker two says
[empty line]
speaker one
what speaker one replies
[empty line]
speaker three
what speaker three says
...
I want to break up the texts into one row per text with columns as the names of speakers. I want to have everything that speaker one says in each text combined in one cell on each row and the same for other speakers. Something like this:
text "speaker one" "speaker two" ...
text1 everything speaker one said everything speaker two said
text2 everything speaker one said everything speaker two said
...
Any help on how to get started would be appreciated.

Using some tidyverse packages you can get there. First read the text with readr::read_file, next split on the empty line, use readr::read_delim to read this into data.frames. As the data is now in a list, using bind_rows will collapse all of this into one data.frame. bind_rows matches on the column names so that all the text of a speaker is in the correct column. Depending on which outcome you want either the first or the second solution.
I leave combining multiple text files up to you.
library(readr)
library(tidyr)
library(dplyr)
# read file into a character vector
text <- readr::read_file("conversation.txt")
# split the text on the empty line
split_text <- strsplit(text, split = "\r\n\r\n")
# read the data in again with read_delim. This will generate a list of data.frames
list_text <- lapply(unlist(split_text), function(x) readr::read_delim(x, col_names = TRUE, delim = "\t"))
# use bind_rows from dplyr to combine everything into 1 tibble. bind_rows matches on the column names.
list_text %>%
bind_rows
# A tibble: 5 x 3
`speaker one` `speaker two` `speaker three`
<chr> <chr> <chr>
1 what speaker one says is in this paragraph. NA NA
2 It might be in multiple lines, but not seperated by an empty line. NA NA
3 NA what speaker two says NA
4 what speaker one replies NA NA
5 NA
Collapsing all the text in one line:
This needs a bit more work with first gathering the data in a tidy long format, collapsing the text and then spreading it wide again. Run the statements in chunks if you want to see what is happening in each step.
list_text %>%
bind_rows %>%
pivot_longer(everything(),
names_to = "speakers",
values_to = "text",
values_drop_na = TRUE) %>%
group_by(speakers) %>%
summarise(text = paste0(text, collapse = " ")) %>%
pivot_wider(names_from = speakers, values_from = text)
# A tibble: 1 x 3
`speaker one` `speaker three` `speaker two`
<chr> <chr> <chr>
1 what speaker one says is in this paragraph. It might be in multiple lines, but not seperated b~ what speaker three s~ what speaker two ~
text used in text file conversation.txt
speaker one
what speaker one says is in this paragraph.
It might be in multiple lines, but not seperated by an empty line.
speaker two
what speaker two says
speaker one
what speaker one replies
speaker three
what speaker three says.

Related

Extract text from CSV in R

I have an Excel .CSV file in which one column has the transcription of a conversation. Whenever the speaker uses Spanish, the Spanish is written within brackets.
One example sentence:
so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day
Ideally, I'd like to extract the English and Spanish separately, so one file would contain all the Spanish words, and another would contain all the English words.
Any ideas on how to do this? Or which function/package to use?
Edited to add: there's about 100 cells that contain text in this Excel sheet. I guess where I'm confused is how do I treat this entire CSV as a "string"?
I don't want to copy and paste every cell as a "strng" -- I was hoping I could someone just upload the entire CSV
To load the CSV into R, you could use readr::read_CSV(YOUR_FILE.CSV). There are more options, some of which are available to you if you use the "File -- Import Dataset -- From Text (readr)" menu option in RStudio.
Supposing you have the data loaded, you will likely need to rely on some form of "regex" to parse the text into sections based on the brackets. There are some base R functions for this, but I find the functions in stringr (part of the tidyverse meta-package) to be useful for this. And tidyr::separate_rows is a nice way to split the text into more lines.
In the regex below, there are a few ingredients:
(?=...) means to split before the [ but to keep it.
\\[ is how we refer to [ because brackets have special meaning in regex so we need to "escape" them to treat them as a literal character.
(?<=...) means to split after the ] but keep it.
| in the last row means "or"
(Granted, I'm still a regex beginner, so I expect there are more concise ways to do this.)
So we could do something like:
df1 <- data.frame(text = "so [usualmente] maybe [me levanto como a las nueve y media] like I exercise and the I like either go to class online or in person like it depends on the day")
library(tidyverse)
df1 %>%
mutate(orig_row = row_number()) %>%
separate_rows(text, sep = "(?=\\[)") %>%
separate_rows(text, sep = "(?<=\\] )") %>%
mutate(language = if_else(str_detect(text, "\\[|\\]"), "Espanol", "English"),
text = str_remove_all(text, "\\[|\\]"))
Result
# A tibble: 5 × 3
text orig_row language
<chr> <int> <chr>
1 "so " 1 English
2 "usualmente " 1 Espanol
3 "maybe " 1 English
4 "me levanto como a las nueve y media " 1 Espanol
5 "like I exercise and the I like either go to class online or in person like it depends on the day" 1 English

regex: extract segments of a string containing a word, between symbols

Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?
With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.

R reading csv file with different seperators

I wanna read a csv file where the first line is written like this:
GeoFIPS,GeoName,Region,TableName,LineCode,IndustryClassification,Description,Unit,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
The following lines are written like this (just one line of 500):
"01000","Alabama",5,SAGDP1,2,"...","Chain-type quantity indexes for real GDP","Quantity index",77.435,80.198,83.178,84.532,84.232,86.440,88.581,94.298,97.490,99.348,99.971,99.317,95.894,98.103,99.525,100.000,101.212,100.544,101.541,102.664,103.827,106.164,107.652
How can I read this file so that all the values are seperated correctly?
With the command: gdp_data <- read.csv("GDP.csv", sep = ",") only the headers get seperated correctly the text of the further lines is all put in the first column.
Thank you very much for your answer.
Your problem appears to be that some of your columns are unnamed rateher than the "different separators". (I don't think there are any different separators.)
Assuming you know in advance how many columns there are, then something like this should work.
readr::read_csv(
"<your file name>",
# Provide custom file names
col_names=c("GeoFIPS","GeoName","Region","TableName", paste0("X", 5:7))
) %>%
# Remove first row of "column names"
filter(row_number() > 1)
# A tibble: 1 x 7
GeoFIPS GeoName Region TableName X5 X6 X7
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 00000 United States dummy SAGDP1 1 ... Real GDP (millions of chained 2012 dollars)
You could, of course, provide more meaningful names for the unnamed columns.

R text mining: Create document term matrix from dataframe, convert to dataframe, retain columns from original dataframe

Thanks to lawyeR for recommending the tidytext package. Here is some code based on that package that seems to work pretty well on my sample data. It doesn't work quite so well though when the value of the text column is blank. (There are times when this will happen and it will make sense to keep the blank rather than filtering it.) I've set the first observation for TVAR to a blank to illustrate. The code drops this observation. How can I get R to keep the observation and to set the frequencies for each word to zero? I tried some ifelse statements using and not using the pipe. It's not working so well though. The troulbe seems to center around the unnest_tokens function from the tidytext package.
sampletxt$TVAR[1] <- ""
chunk_words <- sampletxt %>%
group_by(PTNO, DATE, TYPE) %>%
unnest_tokens(word, TVAR, to_lower = FALSE) %>%
count(word) %>%
spread(word, n, 0)
I have an R data frame. I want to use it to create a document term matrix. Presumably I would want to use the tm package to do that but there might be other ways. I then want to convert that matrix back to a data frame. I want the final data frame to contain identifying variables from the original data frame.
Question is, how do I do that? I found some answers to a similar question, but that was for a data frame with text and a single ID variable. My data are likely to have about half a dozen variables that identify a given text record. So far, attempts to scale up the solution for a single ID variable haven't proven all that successful.
Below are some sample data. I created these for another task which I did manage to solve.
How can I get a version of this data frame that has an additional frequency column for each word in the text entries and that retains variables like PTNO, DATE, and TYPE?
sampletxt <-
structure(
list(
PTNO = c(1, 2, 2, 3, 3),
DATE = structure(c(16801, 16436, 16436, 16832, 16845), class = "Date"),
TYPE = c(
"Progress note",
"Progress note",
"CAT scan",
"Progress note",
"Progress note"
),
TVAR = c(
"This sentence contains the word metastatic and the word Breast plus the phrase denies any symptoms referable to.",
"This sentence contains tnm code T-1, N-O, M-0. This sentence contains contains tnm code T-1, N-O, M-1. This sentence contains tnm code T1N0M0. This sentence contains contains tnm code T1NOM1. This sentence is a sentence!?!?",
"This sentence contains Dr. Seuss and no target words. This sentence contains Ms. Mary J. blige and no target words.",
"This sentence contains the term stageIV and the word Breast. This sentence contains no target words.",
"This sentence contains the word breast and the term metastatic. This sentence contains the word breast and the term stage IV."
)), .Names = c("PTNO", "DATE", "TYPE", "TVAR"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
The quanteda package is faster and more straightforward than tm, and works nicely with tidytext as well. Here's how to do it:
These operations create a corpus from your object, create a document-feature matrix, and then return a data.frame that combines the variables with the feature counts. (Additional options are available when creating the dfm, see ?dfm).
library("quanteda")
samplecorp <- corpus(sampletxt, text_field = "TVAR")
sampledfm <- dfm(samplecorp)
result <- cbind(docvars(sampledfm), as.data.frame(sampledfm))
You can then group by the variables to get your result. (Here I am showing just the first 6 columns.)
dplyr::group_by(result[, 1:6], PTNO, DATE, TYPE)
# # A tibble: 5 x 6
# # Groups: PTNO, DATE, TYPE [5]
# PTNO DATE TYPE this sentence contains
# * <dbl> <date> <chr> <dbl> <dbl> <dbl>
# 1 1 2016-01-01 Progress note 1 1 1
# 2 2 2015-01-01 Progress note 5 6 6
# 3 2 2015-01-01 CAT scan 2 2 2
# 4 3 2016-02-01 Progress note 2 2 2
# 5 3 2016-02-14 Progress note 2 2 2
packageVersion("quanteda")
# [1] ‘0.99.6’
this function from package "SentimentAnalysis" is the easiest way to do this, especially if you are trying to convert from column of a dataframe to DTM (though it also works on txt file!):
library("SentimentAnalysis")
corpus <- VCorpus(VectorSource(df$column_or_txt))
tdm <- TermDocumentMatrix(corpus,
control=list(wordLengths=c(1,Inf),
tokenize=function(x) ngram_tokenize(x, char=FALSE,
ngmin=1, ngmax=2)))
it is simple and works like a charm each time, with both chinese and english, for those doing text mining in chinese.

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Resources