I'am stuck trying to extract, from a big text (around 17000 documents), words that contain punctuation expressions. For example
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The
aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A
cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This
prospective double blind,...[95] c(c(Introduction, Silicosis is a fibrotic"
I would like to extract words like the following:
[1] c(A<sc>IMS AND</sc> M<sc>ETHODS</sc>
[2] c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>
[3] c(PATIENTS & METHODS,
[4] c(c(Introduction
but not for example words like "cross-sectional", or "2013.", or "2)", or "(inability". This is the first step, my idea is to be able to get to this:
"...urine bag tubing and the vent jutting above the summit also strapped with the
white plaster tapeFigure 2), \n\n AIMS AND OBJECTIVES, The aim of this
study is to ... MATERIALS AND METHODS, A cross-sectional study with a ...
surgeries.n), \n\n PATIENTS AND METHODS, This prospective double blind,...
[95] Introduction Silicosis is a fibrotic"
As a way to extract these words and not grabbing any words that include punctuation (like "surgeries.n)"), I have seen that they always start or include "c(" expression. But had some trouble with the regex:
grep("c(", test)
Error en grep("c(", test) :
invalid regular expression 'c(', reason 'Missing ')''
also tried with:
grep("c\\(", test, value = T)
But returns the whole text file. Have also use str_match from the dap package but I don't seem to get the correct pattern (regex) code right. Have any recommendation?
If I understood your problem (I'm unsure your second text is expected output or just a step) I would go with gsub like this:
gsub("(c\\(|<\\/?sc>)","",text)
The regex (first parameter) will match c( or <sc> or </sc> and replace them with nothing, thus cleaning the text as you expect (again, if I understood correctly your expectation).
more on the regex involved:
(|) is the structure to OR condition
c\\( will match literally c( anywhere in the text
<\\/?sc> will match <sc> or </sc> as the ? after the / mean it can be there 0 or 1 time, so it's optionnal.
The double \\ are there so after R interpreter has removed the first backslash there's still a backslash to tell the regex interpreter we want to match a litteral ( and a litteral /
Try this,
text <- "...urine bag tubing and the vent jutting above the summit also strapped with the white plaster tapeFigure 2), \n\nc(A<sc>IMS AND</sc> O<sc>BJECTIVES</sc>, The aim of this study is to ... c(M<sc>ATERIALS AND</sc> M<sc>ETHODS</sc>, A cross-sectional study with a ... surgeries.n), \n\nc(PATIENTS & METHODS, This prospective double blind,...[95] c(c(Introduction, Silicosis is a fibroticf"
require(stringr)
words <- str_split(text, " ")
words[[1]][grepl("c\\(", words[[1]])]
## [1] "\n\nc(A<sc>IMS" "c(M<sc>ATERIALS" "\n\nc(PATIENTS" "c(c(Introduction,"
Related
I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to acquire the smaller attributes I need. The main goal of this project is to scrape case history narrative data. These are paragraphs of natural language, bounded by unique words that are standardized throughout all the individual documents. See below for a reproduced example.
Is there a way I can use those two unique words, ("CASE HISTORY & INVESTIGATOR:"), to bound the text I would like to extract? If not, what sort of approach can I take to extracting the narrative data I need from each report?
text_data <- list("ES SPRINGFEILD POLICE DE FARRELL #789\n NOTIFIED DATE TIME OFFICER\nMARITAL STATUS: UNKNOWN\nIDENTIFIED BY: H. POIROT AT: SCENE DATE: 01/02/1895\nFINGERPRINTS TAKEN BY DATE\n YES NO OBIWAN KENOBI 01/02/1895\n
SPRINGFEILD\n CASE#: 012-345-678\n ABC NOTIFIED: ABC DATE:\n ABC OFFICER: NATURE:\nCASE HISTORY\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\nINVESTIGATOR: HERCULE POIROT \n")
Here is what the expected output would be.
output <- list("This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.")
Thanks so much for helping!
One quick approach would be to use gsub and regexes to replace everything up to and including CASE HISTORY ('^.*CASE HISTORY') and everything after INVESTIGATOR: ('INVESTIGATOR:.*') with nothing. What remains will be the text between those two matches.
gsub('INVESTIGATOR:.*', '', gsub('^.*CASE HISTORY', '', text_data))
[1] "\n This is a string. There are many strings like it, but this one is mine. To be more specific, this is string 456 out of 5000 strings. It’s a case narrative string and\n Case#: 012-345-678\n EXAMINER / INVESTIGATOR'S REPORT\n CITY AND COUNTY OF SPRINGFEILD - RECORD OF CASE\nit continues on another page. It’s 1 page but mostly but often more than 1, 2 even\n the next capitalized word, investigator with a colon, is a unique word where the string stops.\n"
After much deliberation I came to a solution I feel is worth sharing, so here we go:
# unlist text_data
file_contents_unlist <-
paste(unlist(text_data), collapse = " ")
# read lines, squish for good measure.
file_contents_lines <-
file_contents_unlist%>%
readr::read_lines() %>%
str_squish()
# Create indicies in the lines of our text data based upon regex grepl
# functions, be sure they match if scraping multiple chunks of data..
index_case_num_1 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
index_case_num_2 <- which(grepl("(Case#: \\d+[-]\\d+)",
file_contents_lines))
# function basically states, "give me back whatever's in those indices".
pull_case_num <-
function(index_case_num_1, index_case_num_2){
(file_contents_lines[index_case_num_1:index_case_num_2]
)
}
# map2() to iterate.
case_nums <- map2(index_case_num_1,
index_case_num_2,
pull_case_num)
# transform to dataframe
case_nums_df <- as.data.frame.character(case_nums)
# Repeat pattern for other vectors as needed.
index_case_hist_1 <-
which(grepl("CASE HISTORY", file_contents_lines))
index_case_hist_2 <-
which(grepl("Case#: ", file_contents_lines))
pull_case_hist <- function(index_case_hist_1,
index_case_hist_2 )
{(file_contents_lines[index_case_hist_1:index_case_hist_2]
)
}
case_hist <- map2(index_case_hist_1,
index_case_hist_2,
pull_case_hist)
case_hist_df <- as.data.frame.character(case_hist)
# cbind() the vectors, also a good call place to debug from.
cases_comp <- cbind(case_nums_df, case_hist_df)
Thanks all for responding. I hope this solution helps someone out there in the future. :)
I want to do some text mining analysis with my data collected from Facebook, but have some problems with the special/non-English characters in the text. The data looks like:
doc_id
text
001
'ð˜ð—¶ð˜€ ð˜ð—µð—² ð˜€ð—²ð—®ð˜€
002
I expect a return to normalcy...That is Biden’s great
003
'I’m facing a prison sentence
What I want is to remove the words containing these "strange" characters. I tried to do this by using
str_replace_all(text, "[^[:alnum:]]", " ")
But this doesn't work to my case. Any idea?
A general answer to this kind of tasks is to specify the characters you want to keep. It appears that :alnum: comprises the greek letters and letters with accents.
Maybe this regex is more appropriate :
str_remove_all(x, "[^[\\da-zA-Z ]]")
[1] ""
[1] "I expect a return to normalcyThat is Bidens great"
[1] "Im facing a prison sentence"
I just replaced the alpha shortcut by a-zA-Z. I added a whitespace and used the str_remove_all function instead. Add any character you want to keep.
As an R newbie, by using quanteda I am trying to find instances when a certain word sequentially appears somewhere before another certain word in a sentence. To be more specific, I am looking for instances when the word "investors" is located somewhere before the word "shall" in a sentence in the corpus consisted of an international treaty concluded between Morocco and Nigeria (the text can be found here: https://edit.wti.org/app.php/document/show/bde2bcf4-e20b-4d05-a3f1-5b9eb86d3b3b).
The problem is that sometimes there are multiple words between these two words. For instance, sometimes it is written as "investors and investments shall". I tried to apply similar solutions offered on this website. When I tried the solution on (Keyword in context (kwic) for skipgrams?) and ran the following code:
kwic(corpus_mar_nga, phrase("investors * shall"))
I get 0 observations since this counts only instances when there is only one word between "investors" and "shall".
And when I follow another solution offered on (Is it possible to use `kwic` function to find words near to each other?) and ran the following code:
toks <- tokens(corpus_mar_nga)
toks_investors <- tokens_select(toks, "investors", window = 10)
kwic(toks_investors, "shall")
I get instances when "investor" appear also after "shall" and this changes the context fundamentally since in that case, the subject of the sentence is something different.
At the end, in addition to instances of "investors shall", I should also be getting, for example the instances when it reads as "Investors, their investment and host state authorities shall", but I can't do it with the above codes.
Could anyone offer me a solution on this issue?
Huge thanks in advance!
Good question. Here are two methods, one relying on regular expressions on the corpus text, and the second using (as #Kohei_Watanabe suggests in the comment) using window for tokens_select().
First, create some sample text.
library("quanteda")
## Package version: 2.1.2
# sample text
txt <- c("The investors and their supporters shall do something.
Shall we tell the investors? Investors shall invest.
Shall someone else do something?")
Now reshape this into sentences, since your search occurs within sentence.
# reshape to sentences
corp <- txt %>%
corpus() %>%
corpus_reshape(to = "sentences")
Method 1 uses regular expressions. We add a boundary (\\b) before "investors", and the .+ says one or more of any character in between "investors" and "shall". (This would not catch newlines, but corpus_reshape(x, to = "sentences") will remove them.)
# method 1: regular expressions
corp$flag <- stringi::stri_detect_regex(corp, "\\binvestors.+shall",
case_insensitive = TRUE
)
print(corpus_subset(corp, flag == TRUE), -1, -1)
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "The investors and their supporters shall do something."
##
## text1.2 :
## "Investors shall invest."
A second method applies tokens_select() with an asymmetric window, with kwic(). First we select all documents (which are sentences) containing "investors", but discarding tokens before and keeping all tokens after. 1000 tokens after should be enough. Then, apply the kwic() where we keep all context words but focus on the word after, which by definition must be after, since the first word was "investors".
# method 2: tokens_select()
toks <- tokens(corp)
tokens_select(toks, "investors", window = c(0, 1000)) %>%
kwic("shall", window = 1000)
##
## [text1.1, 5] investors and their supporters | shall | do something.
## [text1.3, 2] Investors | shall | invest.
The choice depends on what suits your needs best.
In my data (which is text), there are abbreviations.
Is there any functions or code that search for abbreviations in text? For example, detecting 3-4-5 capital letter abbreviations and letting me count how often they happen.
Much appreciated!
detecting 3-4-5 capital letter abbreviations
You may use
\b[A-Z]{3,5}\b
See the regex demo
Details:
\b - a word boundary
[A-Z]{3,5} - 3, 4 or 5 capital letters (use [[:upper:]] to match letters other than ASCII, too)
\b - a word boundary.
R demo online (leveraging the regex occurrence count code from #TheComeOnMan)
abbrev_regex <- "\\b[A-Z]{3,5}\\b";
x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
sum(gregexpr(abbrev_regex,x)[[1]] > 0)
## => [1] 3
regmatches(x, gregexpr(abbrev_regex, x))[[1]]
## => [1] "XYZ" "WXYZ" "VWXYZ"
You can use the regular expression [A-Z] to match any ocurrence of acapital letter. If you want this pattern to be repeated 3 times you can add \1{3} to your regex. Consider using variables and a loop to get the job done for 3 to 5 repetition times.
I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:
MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that.
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.
MR. JOHN: Thank you
In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:
MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1
Thanks for pointers using R!
You can use the pattern : to split the string by and then use table:
table(sapply(strsplit(x, ":"), "[[", 1))
# MR. JOHN MR. LEHMAN MS. SMITH
# 2 1 1
strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency
Edit: Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.
tt <- readLines("./tmp.txt")
Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.
Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).
You can use strsplit followed by sapply (as shown below)
Using strsplit:
# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:
out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))
There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.
Of course, this assumes that there is no other line, for example, like this:
"Mr. Chariman, whatever (bla bla): It is not a problem"
Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.