Extracting text between symbols, over multiple lines [duplicate] - r

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 2 years ago.
I have XML files of parlament protocols, of which I want to extract all of the interruptions mentioned. The interruptions are marked by brackets - like this:
Text I don't care about.
(applause from the right)
Text I don't care about.
I was given this code, which seemed to work just fine:
files <- as.list(dir(pattern = ".xml"))
my_list <- lapply(files, function(x) xmlToList(xmlParse(x)))
my_list2 <- lapply(my_list, function(x) enframe(regmatches(x[["TEXT"]],
gregexpr("(?=\\().*?(?<=\\))", x[["TEXT"]], perl=T))[[1]])
Like this I only got the (applause from the right), but now I realised, that this code apparently only considers text per line and I have some interruptions over multiple lines (1 - 3), like this
Text I don't care about.
(applause from the right and
from the left)
Text I don't care about.
If the interruption is in this format, I get no results. How do I have to change the gregexpr to look for one line, but also for multiple lines, until the corresponding ")" is found? I've been trying \n but so far no luck.
Thanks in advance
Edit
To further explain myself: I am looking at multiple hundreds of protocols (each one has its own XML file), each with multiple hundreds of these interruptions. So I am more specifically looking for a solution to extract them all with the same code.
A solution close to the code I used before would be extra helpful, since I am still fairly new to R.

Here is one way.
Sample2 = "Text I don't care about.
(applause from the right and
from the left)
Text I don't care about."
sub(".*\\((.*?)\\).*", "\\1", Sample2)
[1] "applause from the right and\n from the left"

Related

Understanding the logic of R code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am learning R through tutorials, but I have difficulties in "how to read" R code, which in turn makes it difficult to write R code. For example:
dir.create(file.path("testdir2","testdir3"), recursive = TRUE)
vs
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
While I know what these lines of code do, I cannot read or interpret the logic of each line of code. Whether I read left to right or right to left. What strategies should I use when reading/writing R code?
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Don't let lines of code like this ruin writing R code for you
I'm going to be honest here. The code is bad. And for many reasons.
Not a lot of people can read a line like this and intuitively know what the output is.
The point is you should not write lines of code that you don't understand. This is not Excel, you do not have but 1 single line to fit everything within. You have a whole deliciously large script, an empty canvas. Use that space to break your code into smaller bits that make a beautiful mosaic piece of art! Let's dive in~
Dissecting the code: Data Frames
Reading a line of code is like looking at a face for familiar features. You can read left to right, middle to out, whatever -- as long as you can lock onto something that is familiar.
Okay you see data.combined. You know (hope) it has rows and columns... because it's data!
You spot a $ in the code and you know it has to be a data.frame. This is because only lists and data.frames (which are really just lists) allow you to subset columns using $ followed by the column name. Subset-by the way- just means looking at a portion of the overall. In R, subsetting for data.frames and matrices can be done using single brackets[, within which you will see [row, column]. Thus if we type data.combined[1,2], it would give you the value in row 1 of column 2.
Now, if you knew that the name of column 2 was name you can use data.combined[1,"name"] to get the same output as data.combined$name[1]. Look back at that code:
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Okay, so now we see our eyes should be locked on data.combined[SOMETHING IS IN HERE?!]) and slowly be picking out data.combined[ ?ROW? , Oh the "name" column]. Cool.
Finding those ROW values!
which(duplicated(as.character(data.combined$name)))
Anytime you see the which function, it is just giving you locations. An example: For the logical vector a = c(1,2,2,1), which(a == 1) would give you 1 and 4, the location of 1s in a.
Now duplicated is simple too. duplicated(a) (which is just duplicated(c(1,2,2,1))) will give you back FALSE FALSE TRUE TRUE. If we ran which(duplicated(a)) it would return 3 and 4. Now here is a secret you will learn. If you have TRUES and FALSES, you don't need to use the which function! So maybe which was unnessary here. And also as.character... since duplicated works on numbers and strings.
What You Should Be Writing
Who am I to tell you how to write code? But here's my take.
Don't mix up ways of subsetting: use EITHER data.frame[,column] or data.frame$column...
The code could have been written a little bit more legibly as:
dupes <- duplicated(data.combined$name)
dupe.names <- data.combines$name[dupes]
or equally:
dupes <- duplicated(data.combined[,"name"])
dupe.names <- data.combined[dupes,"name"]
I know this was lengthy but I hope it helps.
An easier way to read any code is to break up their components.
dup.names <-
as.character(
data.combined[which(
duplicated(
as.character(
data.combined$name
)
)
), "name"]
)
For each of the functions - those parts with rounded brackets following them e.g. as.character() you can learn more about what they do and how they work by typing ?as.character in the console
Square brackets [] are use to subset data frames, which are stored in your environment (the box to the upper right if you're using R within RStudio contains your values as well as any defined functions). In this case, you can tell that data.combined is the name that has been given to such a data frame in this example (type ?data.frame to find out more about data frames).
"Unwrapping" long lines of code can be daunting at first. Start by breaking it down into parenthesis , brackets, and commas. Parenthesis directly tacked onto a word indicate a function, and any commas that lie within them (unless they are part of another nested function or bracket) separate arguments which contain parameters that modify the way the function behaves. We can reduce your 2nd line to an outer function as.character and its arguments:
dup.names <- as.character(argument_1)
Just from this, we know that dup.names will be assigned a value with the data type "character" off of a single argument.
Two functions in the first line, file.path() and dir.create(), contain a comma to denote two arguments. Arguments can either be a single value or specified with an equal sign. In this case, the output of file.path happens to perform as argument #1 of dir.create().
file.path(argument_1,argument_2)
dir.create(argument_1,argument_2)
Brackets are a way of subsetting data frames, with the general notation of dataframe_object[row,column]. Within your second line is a dataframe object, data.combined. You know it's a dataframe object because of the brackets directly tacked onto it, and knowing this allows you to that any functions internal to this are contributing to subsetting this data frame.
data.combined[row, column]
So from there, we can see that the internal functions within this bracket will produce an output that specifies the rows of data.combined that will contribute to the subset, and that only columns with name "name" will be selected.
Use the help function to start to unpack these lines by discovering what each function does, and what it's arguments are.

replace all rare words from the text (substitute very large number of strings in a large text)

I have a large text and wanted to replace all the words that have low frequency, with some marker, example "^rare^". My document is 1.7 million lines and after cleaning it up it has 482,932 unique words, out of which more than 400 thousand occur less than 6 these are the ones that I want to replace.
Couple ways that I know how take longer than is practical. For instance I just tried mgsub from qdap package.
test <- mgsub(rare, "<UNK>", smtxt$text)
Where rare is a vector of all the rare words and smtxt$text is the vector that holds all the text, one sentence per row.
R is still processing it.
I think, since each word is begin checked against each sentence this is expected. For now I am resigned to forget about doing something like this. I would like to hear for others if there is another way. Since I still have not looked into many option besides what I know: gsub, and mgsub and also tried turning the text into corpus to see if it will process faster.
Thanks

How to check if a paragraph is part of a text in R

I have one paragrah of text (a vector of words) and I would like to see if it is "part" of a long text (a vector of words). However, I am know that this paragraph does not appear in the text in its exact form, but with slight changes: a few words could miss, the order could be slightly different, some words could be inserted as parenthetical elements etc.
I am currently implementing solutions "by hand", such as looking if most of the words of the paragraph are in the text, looking the distance between these words, their order, etc...
I was however wondering if there is no built-in method to do that?
I already checked the tm package, but it does not seem to do that...
Any idea?
I fear that you are stuck with hand-writing an approach, e.g. grep-ing some word groups and having some kind of matching threshold.

Read only first few lines of CSV (or any text file) [duplicate]

This question already has answers here:
How to read first 1000 lines of .csv file into R? [closed]
(2 answers)
Closed 7 years ago.
I often come across larger datasets where I don't know what they actually contain.
Instead of waiting for them to be opened in a conventional text editor or within the RStudio text editor, for example, I would like to look at the first few lines.
I don't even have to parse the contents, scanning these first few lines will help me to determine what method to use.
Is there a function/package for this?
read.table has an nrows option:
nrows: integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
so read in a few and see what you've got.
If you have a Unix environment then the command head file.csv will show the first ten lines. There are lots of other useful Unix commands (guess what tail file.csv does) and even if you are on Windows you could benefit from installing Cygwin and learning it!
Here is an answer to your question:
How to read first 1000 lines of .csv file into R?
Basically use nrows in read.csv or read.table...

Reading a specific line from a huge file *fast* [duplicate]

This question already has an answer here:
Efficiently reading specific lines from large files into R
(1 answer)
Closed 9 years ago.
I have a huge comma-delimited file (1.5 Gb) and want to read one particular line from the file in R.
I've seen (many) versions of this question many times, and all suggest something like
con = file(fileName)
open(con)
scan(con, what=list("character", "character"), skip=1000000, nlines=1, sep="\t", quiet=TRUE)
That works, but it's still extremely slow - we're talking between 20 and 30 seconds to read a single line!
Is there a faster way? Surely there must be a fast way to jump to a particular line...
Thanks a million!
Do you know anything else about the structure of your file?
If every line/row has exactly the same number of bytes then you could calculate the number of bytes and seek to the beginning of the line.
However if the number of bytes per line is not exactly the same for every line then you need to read every character, check to see if it is a newline (or other carriage return, or both) and count those to find the line that you are looking for. This is what the skip argument to scan and friends does.
There may be other tools that are quicker at doing the reading and counting that you could have preprocess your file and return only the line of interest.
If you are going to do something like this multiple times then it could speed up the overall process to read the file into a different structure, such as a data base, that can access arbitrary lines directly, or pre index the lines so that you can seek directly to the appropriate line.

Resources