Replace Text in XML files with placeholder text - r

I performed text mining on files that I am preparing for publication right now. There are several XML files that contain text within segments (see basic example below). Due to copyright restrictions, I have to make sure that the files that I am going to publish do not contain the whole text while someone who has the texts should be able to 'reconstruct' the files. To make sure that one can still perform basic text mining (= count lengths), the segment length should not change. Therefore I am looking for a way to replace every word except for the first and last one in all segments with dummy / placeholder text.
Basic example:
Input:
<text>
<div>
<seg xml:id="A">Lorem ipsum dolor sit amet</seg>
<seg xml:id="B">sed diam nonumy eirmod tempor invidunt</seg>
</div>
</text>
Output:
<text>
<div>
<seg xml:id="A">Lorem blank blank blank amet</seg>
<seg xml:id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>

There is rapply to recursively replace values in a nested list:
Let be data.xml containing your input.
library(tidyverse)
library(xml2)
read_xml("data.xml") %>%
as_list() %>%
rapply(how = "replace", function(x) {
tokens <-
x %>%
str_split(" ") %>%
simplify()
n_tokens <- length(tokens)
c(
tokens[[1]],
rep("blank", n_tokens - 2),
tokens[[n_tokens]]
) %>%
paste0(collapse = " ")
}) %>%
as_xml_document() %>%
write_xml("data2.xml")
Output file data2.xml:
<?xml version="1.0" encoding="UTF-8"?>
<text>
<div>
<seg id="A">Lorem blank blank blank amet</seg>
<seg id="B">sed blank blank blank blank invidunt</seg>
</div>
</text>

Related

xml in R, remove paragraphs but keep xml class

I am trying to remove some paragraphs from an XML document in R, but I want to keep the XML structure/class. Here's some example text and my failed attempts:
library(xml2)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
xml_find_all(text, './/caption//p') %>% xml_remove() # deletes text
xml_find_all(text, './/caption//p') %>% xml_text() # removes paragraphs but also XML structure
Here's what I would like to end up with (just the paragraphs in the caption removed):
ideal_text = read_xml("<paper> <caption>The main title A sub title</caption> <p>The opening paragraph.</p> </paper>")
ideal_text
It looks like this requires multiple steps. Find the node, copy the text, remove the contents of the node and then update.
library(xml2)
library(magrittr)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
# find the caption
caption <- xml_find_all(text, './/caption')
#store existing text
replacemement<- caption %>% xml_find_all( './/p') %>% xml_text() %>% paste(collapse = " ")
#remove the desired text
caption %>% xml_find_all( './/p') %>% xml_remove()
#replace the caption
xml_text(caption) <- replacemement
text #test
{xml_document}
<paper>
[1] <caption>The main title A sub title</caption>
[2] <p>The opening paragraph.</p>
Most likely you will need to obtain the vector/list of caption nodes and then step through them one-by-one with a loop.

How to extract text only from parent HTML node (excluding child node)?

I have a code:
<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
</div>
I am using this code to extract text:
html_nodes(messageNode, xpath=".//p") %>% html_text() %>% paste0(collapse="\n")
And getting the result:
"(22) where?"
But I need only "p" text, excluding text that could be inside "p" in child nodes. I have to get this text:
"where"
Is there any way to exclude child nodes while I getting text?
Mac OS 10.11.6 (15G31), RSrudio Version 0.99.903, R version 3.3.1 (2016-06-21)
If you are sure the text you want always comes last you can use:
doc %>% html_nodes(xpath=".//p/text()[last()]") %>% xml_text(trim = TRUE)
Alternatively you can use the following to select all "non empty" trings
doc %>% html_nodes(xpath=".//p/text()[normalize-space()]") %>% xml_text(trim = TRUE)
For more details on normalize-space() see https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space
3rd option would be to use the xml2 package directly via:
doc %>% xml2::xml_find_chr(xpath="normalize-space(.//p/text())")
This will grab all the text from <p> children (which means it won't include text from sub-nodes that aren't "text emitters":
library(xml2)
library(rvest)
library(purrr)
txt <- '<div class="activityBody postBody thing">
<p>
(22)
where?
</p>
<p>
stays
<b>disappears</b>
<a>disappears</a>
<span>disappears</span>
stays
</p>
</div>'
doc <- read_xml(txt)
html_nodes(doc, xpath="//p") %>%
map_chr(~paste0(html_text(html_nodes(., xpath="./text()"), trim=TRUE), collapse=" "))
## [1] "where?" "stays stays"
Unfortunately, that's pretty "lossy" (you lose <b>, <span>, etc) but this or #Floo0's (also potentially lossy) solution may work sufficiently for you.
If you use the XML package you can actually edit nodes (i.e. delete node elements).

Regex to remove white space between tags in gsub R

How to remove white space or tabulation between tags, without removing it from inside the tags, i tried gsub but didn't succeed
gsub("(^>)\\s(^<)", "", x)
Given a string like :
"<div class=\"panel\">\n <div class=\"shortcode\">\n\t <div class=\"article-\"> text text text text </div> \n </div>\n </div>"
Desired output:
<div class=\"panel\"><div class=\"shortcode\"><div class=\"article-\"> text text text text </div></div></div>
You could try using a look around
gsub("(?<=\\>)(\\s*)(?=\\<)", "", x, perl = TRUE)
## [1] "<div class=\"panel\"><div class=\"shortcode\"><div class=\"article-\"> text text text text </div></div></div>"
We can use the fact that the tags have \n between them giving particularly simple solutions:
1) If s is the input string then:
gsub("\\s*\n\\s*", "", s)
(If \t cannot appear within tags as is the case in the question then the pattern could alternately be written as " *[\n\t] *".)
2) Another way is:
paste(sapply(strsplit(s, "\n"), trimws), collapse = "")

applying formatting to a character range in textflows in Flex 4.10 SDK

When using any version of Flex 4.10 SDK the following code apply's the format to an entire paragraph instead of a specific character range.
https://issues.apache.org/jira/browse/FLEX-33791
<?xml version="1.0"?>
<s:Application xmlns:fx="http://ns.adobe.com/mxml/2009" xmlns:s="library://ns.adobe.com/flex/spark" creationComplete="OnCreationComplete(event)">
<s:TextArea width="100%" height="100%" id="txt" editable="true">
<s:content>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.<s:br/>
Vivamus eu erat ac est ullamcorper egestas eget nec mauris.<s:br/>
</s:content>
</s:TextArea>
<fx:Script><![CDATA[
import flashx.textLayout.edit.EditManager;
import flashx.textLayout.formats.TextLayoutFormat;
import mx.events.FlexEvent;
private function OnCreationComplete(event:FlexEvent):void
{
var objFormat:TextLayoutFormat = new TextLayoutFormat();
objFormat.backgroundColor = 0xB9CCFF;
txt.selectRange(5, 8);
var objManager:EditManager = txt.textFlow.interactionManager as EditManager;
objManager.applyFormat(objFormat, objFormat, objFormat);
}
]]></fx:Script>
</s:Application>
The three parameters for applyFormat are for three different ways the format can be applied.
The first parameter, "leafFormat" will get applied to LeafElement objects like SpanElement (or nodes, if you prefer to think of the XML that TLF generates) and will actually create a new leaf if the current (or supplied) SelectionState doesn't encompass an entire LeafElement.
The second parameter, "paragraphFormat" will get applied to the entire paragraph that the current (or supplied) SelectionState is a part of. So if I select only a few characters from a paragraph and then call applyFormat, passing in a background color for the "paragraphFormat" parameter, the entire paragraph will get the background color.
The third parameter, "containerFormat" I've never used and haven't really looked into at all. I would guess that it applies the format to the entire ContainerController object that helps lay out the text.
You can safely pass null (or completely different formats) in for any of the four parameters.
So, in short, I think to fix your problem you just change you function call to:
objManager.applyFormat(objFormat, null, null);

Qt how to break the LabelText of QInputDialog

Here is my example code:
QInputDialog* inDialog = new QInputDialog();
inDialog->setMaximumWidth(100);
inDialog->setLabelText(QString("long and very long......you can say very long"));
The input box showing really long (as long as the string), I was expected the way to set word-wrap for the LabelText, but it seem QInputDialog has no method for that!!!
What can I do now? Write my own InputDialog class? Oh no...!
I hope there is a better way for it!
I would do it myself, like this for example :
QString s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut" ;
QString wrapped ;
if(s.length()>35)
{
wrapped = s.left(15) + QString(".....") + s.right(15) ;
}
else
{
wrapped = s ;
}
inDialog->setLabelText(wrapped) ;
I'm just starting with QT so this may not be the best way to get what you want but heres what I would do.
I would create my own custom input dialog which inherits QInputDialog. I would then override the setLabelText function to check if the string length is less than 100.
If it is less than 100 then you can go ahead and display it. If not then you can choose where to add yours dots and remove words in order to bring the size down.
Once its equal to 100 characters or less, you can display it.
I will try and write an example when I get home if you would like.

Resources