R split text on empty line - r

I have a very long file that looks like this :
"Ach! Hans, Run!"
2RRGG
Enchantment
At the beginning of your upkeep, you may say "Ach! Hans, run! It's the . . ." and name a creature card. If you do, search your library for the named card, put it into play, then shuffle your library. That creature has haste. Remove it from the game at end of turn.
UNH-R
A Display of My Dark Power
Scheme
When you set this scheme in motion, until your next turn, whenever a player taps a land for mana, that player adds one mana to his or her mana pool of any type that land produced.
ARC-C
AErathi Berserker
2RRR
Creature -- Human Berserker
2/4
Rampage 3 (Whenever this creature becomes blocked, it gets +3/+3 until end of turn for each creature blocking it beyond the first.)
LE-U
AEther Adept
1UU
Creature -- Human Wizard
2/2
When AEther Adept enters the battlefield, return target creature to its owner's hand.
M11-C, M12-C, DDM-C
...
I'd like to load this file into a data.frame or vector "oracle", split by each empty line(actually a space and a newline) so that
oracle[1]
gives output like
"Ach! Hans, Run!" 2RRGG Enchantment At the beginning of your upkeep, you may say "Ach! Hans, run! It's the . . ." and name a creature card. If you do, search your library for the named card, put it into play, then shuffle your library. That creature has haste. Remove it from the game at end of turn. UNH-R
I've tried code like
oracle <- read.table(file = "All Sets.txt", quote = "", sep="\n")
as well as scan(), but
oracle[1]
gives very long, undesired output.
Thanks!

Try this, based on your edited question:
oracle <- readLines("BenYoung2.txt")
nvec <- length(oracle)
breaks <- which(! nzchar(oracle))
nbreaks <- length(breaks)
if (breaks[nbreaks] < nvec) {
breaks <- c(breaks, nvec + 1L)
nbreaks <- nbreaks + 1L
}
if (nbreaks > 0L) {
oracle <- mapply(function(a,b) paste(oracle[a:b], collapse = " "),
c(1L, 1L + breaks[-nbreaks]),
breaks - 1L)
}
oracle[1]
# [1] "\"Ach! Hans, Run!\" 2RRGG Enchantment At the beginning of your upkeep, you may say \"Ach! Hans, run! It's the . . .\" and name a creature card. If you do, search your library for the named card, put it into play, then shuffle your library. That creature has haste. Remove it from the game at end of turn. UNH-R"
Edit: though this works fine if you always have truly-empty lines as breaks, you can use this line instead to use lines with white-space only:
breaks <- which(grepl("^[[:space:]]*$", oracle))
This gives the same results when the lines are truly empty.

I think it's easiest to build a new variable that says which group the line belongs in, then group by that and call paste. In base R:
lines <- readLines(textConnection(txt))
i <- cumsum(lines == '')
by(lines, i, paste, collapse='\n')

The most straight forward way to do that is first splitting on a line break (i.e. \n), then throwing away empty lines.
text = "line1
line2
line3
"
split1 = unlist(strsplit(text, "\n"))
filter = split1[split1 != ""]
# [1] "line1" "line2" "line3"

Related

rvest object cloned with rlang::duplicate is not properly cloned

rvest doesn't seem to offer any way to extract text from parent object only (ignoring children). One workaround uses xml_remove(), which mutates the original object - all the way up the memory chain given R's default lazy evaluation.
I look to rlang::duplicate(), which is supposed for "modifying the copy leaves the original object intact", but the clone does not appear to be truly independent. For example:
require(rvest)
h = '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc = xml2::read_html(h)
x = html_node(doc, '#target')
html_text(x)
#> [1] "\ntext to extract\ntext to ignorethis too"
Now clone x, remove its children, and extract the text:
x2 = rlang::duplicate(x, shallow = FALSE)
children = html_children(x2)
xml2::xml_remove(children)
html_text(x2)
#> [1] "\ntext to extract\n"
That works as intended, however x has also been mutated:
html_text(x)
#> [1] "\ntext to extract\n"
Any suggestions why and how to workaround this? I do not want to start re-attaching children..
First of all let me say that I think yoo can solve the issue without copying the data. I'm not an expert in xpath, but I think you can use it to just select only direct text descendents, ignoring text nested in other xml nodes. I.e. the following seems to do the trick without any copy (x defined as in your question):
html_text(html_elements(x, xpath = "text()"))
# [1] "\ntext to extract\n"
That being said, I also have an answer to the question on how to make a deep copy:
The problem is that rlang::duplicate() can only copy R data structures. However, rvest builds on xml2, and xml2 builds on the C library libxml2.
When you create the xml_node object in R, the corresponding data structure is created in libxml2. On the R side, there is basically just a pointer to the libxml2 object. So rlang::duplicate() will only create a copy of that pointer, but not of the underlying data. It cannot do so, because it has no access to it as it is in a different library (that rlang doesn't know of).
The easiest way to create a copy of the underlying data seems to be to serialize and deserialze the xml. I suspect this is not very efficent though.
Example:
Read in the original data:
require(rvest)
h <- '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc <- xml2::read_html(h)
x <- html_node(doc, '#target')
Create two copies - one with rlang:duplicate() and one with xml2::xml_unserialize():
x1 <- rlang::duplicate(x, shallow = FALSE)
x2 <- xml2::xml_unserialize(xml2::xml_serialize(x, NULL))
Check that x and x1 are in fact identical, while x2 is a true copy (the memory locations you get will be of course be different to the ones shown here):
x$doc
# <pointer: 0x0000023911334ea0>
x1$doc
# <pointer: 0x0000023911334ea0>
# --> same as x
x2$doc
# <pointer: 0x00000239113377d0>
# --> different to x
Test that everything works as intented:
children <- html_children(x2)
xml2::xml_remove(children)
html_text(x2)
# [1] "\n text to extract\n "
html_text(x)
# [1] "\n text to extract\n text to ignorethis too"
Another potential solution (maybe a more general approach) is to use the html_children() function to obtain the text of all the child nodes and then remove that from the full text.
require(rvest)
h = '<ul>
<li id="target">
text to extract
<ul><li>text to ignore</li><li>this too</li></ul>
</li>
</ul>'
doc = xml2::read_html(h)
x = html_node(doc, '#target')
fulltext <- html_text(x)
# [1] "\ntext to extract\ntext to ignorethis too"
#find the text in the children nodes
childtext <- html_children(x) %>% html_text()
# "text to ignorethis too"
#replace the child node text with a numm
gsub(childtext, "", fulltext) %>% trimws()
#"text to extract"
#alternative using the text from the first child node
firstchild <- xml_child(x, search=1) %>% xml_text()
gsub(paste0(firstchild, ".*"), "", fulltext)
Of course, if there are additional newline "\n" or formatting character, the gsub() may break.

paste specific text to strings that do not have it

I would like to paste "miR" to strings that do not have "miR" already, and skipping those that have it.
paste("miR", ....)
in
c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
out
c("miR-26b", "miR-26a", "miR-1297", "miR-4465", "miR-26b", "miR-26a")
One way could be by removing "miR" if it is present in the beginning of the string using sub and pasting it to every string irrespectively.
paste0("miR-", sub("^miR-","", x))
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
data
x <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
vec <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
sub("^(?!miR)(.*)$", "miR-\\1", vec, perl = T)
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
If you want to learn more:
type ?sub into R console
learn regex, have a closer look at negative look ahead, capturing groups LEARN REGEX
I've used perl = T because I get an error if I don't. READ MORE

Remove a verb as a stopword

There are some words which are used sometimes as a verb and sometimes as other part of speech.
Example
A sentence with the meaning of the word as verb:
I blame myself for what happened
And a sentence with the meaning of word as noun:
For what happened the blame is yours
The word I want to detect is known to me, in the example above is "blame". I would like to detect and remove as stopwords only when it has meaning like a verb.
Is there any easy way to make it?
You can install TreeTagger and then use the koRpus package in R to use TreeTagger from R. Install it in a location like e.g. C:\Treetagger.
I will first show how treetagger works so you understand what's going in the actual solution further down below in this answer:
Intro treetagger
library(koRpus)
your_sentences <- c("I blame myself for what happened",
"For what happened the blame is yours")
text.tagged <- treetag(file="I blame myself for what happened",
format="obj", treetagger="manual", lang="en",
TT.options = list(path="C:\\Treetagger", preset="en") )
text.tagged#TT.res[, 1:2]
# token tag
#1 I PP
#2 blame VVP
#3 myself PP
#4 for IN
#5 what WP
#6 happened VVD
The sentences have been analysed now and the "only thing left" is to remove those occurrences of "blame" that are a verb.
Solution
I'll do this sentence for sentence by creating a function that first tags the sentence, then checks for "bad words" like "blame" that are also a verb and finally removes them from the sentence:
remove_words <- function(sentence, badword="blame"){
tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en",
TT.options=list(path=":C\\Treetagger", preset="en"))
# Check for bad words AND verb:
cond1 <- (tagged.text#TT.res$token == badword)
cond2 <- (substring(tagged.text#TT.res$tag, 0, 1) == "V")
redflag <- which(cond1 & cond2)
# If no such case, return sentence as is. If so, then remove that word:
if(length(redflag) == 0) return(sentence)
else{
splitsent <- strsplit(sentence, " ")[[1]]
splitsent <- splitsent[-redflag]
return(paste0(splitsent, collapse=" "))
}
}
lapply(your_sentences, remove_words)
# [[1]]
# [1] "I myself for what happened"
# [[2]]
# [1] "For what happened the blame is yours"
In python it is done as:
from nltk import pos_tag
s1 = "I blame myself for what happened"
pos_tag(s1.split())
It will give you words with there tags
You can do something like this in Python
.
import ntlk
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
And add youre filter to eliminate Verbs for instance .
Hope this is helpful !

Give a new variable value 0 or 1 based on the distance between two words in another variable

I am new to R. In my dataset, I have a variable called Reason . I want to create a new column called Price. If any of the following conditions is met:
word "Price" and word "High" are both mentioned in Reason and the distance between them is less than 6 words
word "Price" and word "expensive" are both mentioned in Reason and the distance between them is less than 6 words
-word "Price" and word "increase" are both mentioned in Reason and the distance between them is less than 6 words
than Price=1. Otherwise, price=0.
I found the following user defined function to get the distance between 2 words
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
but I don't know how to apply it the whole column to get the expected results. I tried the following code but it only give me "logical(0)" as the result.
for (j in seq(Survey$Reason))
{
Survey$Price[[j]]<- distance(Survey$Reason[[j]], " price ", " high ") <=6
}
Any help is highly appreciated.
Thanks
Starting from your sample data:
survey <- structure(list(Reason = c("Their price are extremely high.", "Because my price was increased so much, I wouldn't want anyone else to have to deal with that.", "Just because the intial workings were fine, but after we realised it would affect our contract, it left a sour taste in our mouth.", "Problems with the repair", "They did not handle my complaint as well I would have liked.", "Bad service overall.")), .Names = "Reason", row.names = c(NA, 6L), class = "data.frame")
First, I updated your fonction to remove punctuation and directrly returns your position test
distanceOK <- function(string, term1, term2,n=6) {
words <- strsplit(gsub("[[:punct:]]", "", string), "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
dist <- abs(indices[term1] - indices[term2])
ifelse(is.na(dist)|dist>n,0,1)
}
Then we apply:
survey$Price <- sapply(survey$Reason, FUN=function(str) distanceOK(str, "price","high"))

How to read unquoted extra \r with data.table::fread

Data I have to process has unquoted text with some additional \r character. Files are big (500MB), copious (>600), and changing the export is not an option. Data might look like
A,B,C
blah,a,1
bloo,a\r,b
blee,c,d
How can this be handled with data.table's fread?
Is there a better R read CSV function for this, that's similarly performant?
Repro
library(data.table)
csv<-"A,B,C\r\n
blah,a,1\r\n
bloo,a\r,b\r\n
blee,c,d\r\n"
fread(csv)
Error in fread(csv) :
Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0:
bloo,a
Advanced repro
The simple repro might be too trivial to give a sense of scale...
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Naive approach
fread("sample.csv")
# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) : file name conversion problem -- name too long?
# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage 48029706 is too close to the limit
Further to #dirk-eddelbuettel & #nrussell's suggestions, a way of solving this is to is to pre-process the file. The processor could also be called within fread() but here it is performed in seperate steps:
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")
We can try with gsub
fread(gsub("\r\n|\r", "", csv))
# A B C
#1: blah a 1
#2: bloo a b
#3: blee c d
You can also do this with tidyverse packages, if you'd like.
> library(readr)
> library(stringr)
> read_csv(str_replace_all(csv, "\r", ""))
# A tibble: 3 × 3
A B C
<chr> <chr> <chr>
1 blah a 1
2 bloo a b
3 blee c d
If you do want to do it purely in R, you could try working with connections. As long as a connection is kept open, it will start reading/writing from its previous position. Of course, this means the burden of opening and closing connections falls on you.
In the following code, the file is processed by chunks:
library(data.table)
input_csv <- "sample.csv"
in_conn <- file(input_csv)
output_csv <- "out.csv"
out_conn <- file(output_csv, "w+")
open(in_conn)
chunk_size <- 1E6
return_pattern <- "(?<=^|,|\n)([^,]*(?<!\n)\r(?!\n)[^,]*)(?=,|\n|$)"
buffer <- ""
repeat {
new_chars <- readChar(in_conn, chunk_size)
buffer <- paste0(buffer, new_chars)
while (grepl("[\r\n]$", buffer, perl = TRUE)) {
next_char <- readChar(in_conn, 1)
buffer <- paste0(buffer, next_char)
if (!length(next_char))
break
}
chunk <- gsub("(.*)[,\n][^,\n]*$", "\\1", buffer, perl = TRUE)
buffer <- substr(buffer, nchar(chunk) + 1, nchar(buffer))
cleaned <- gsub(return_pattern, '"\\1"', chunk, perl = TRUE)
writeChar(cleaned, out_conn, eos = NULL)
if (!length(new_chars))
break
}
writeChar('\n', out_conn, eos = NULL)
close(in_conn)
close(out_conn)
result <- fread(output_csv)
Process:
If a chunk ends with a \r or \n, another character is added until it doesn't.
Quotes are put around values containing a \r which isn't adjacent to a
\n.
The cleaned chunk is added to the end of another file.
Rinse and repeat.
This code simplifies the problem by assuming no quoting is done for any field in sample.csv. It's not especially fast, but not terribly slow. Larger values for chunk_size should reduce the amount of time spent in I/O operations. If used for anything beyond this toy example, I'd strongly suggesting wrapping it in a tryCatch(...) call to make sure the files are closed afterwards.

Resources