Divide the sentence into 2 parts in R - r

This is text file around (20 txt file)
In each text file
Suhas - Politics
Pope Francis has highlighted the plight of refugees from Syria and Iraq and condemned extremism at the start of a key visit to Turkey.
Sachin - Sports
Defending champion PV Sindhu continued her good run and entered the semifinals of the women's singles competition after beating China's Han Li in three games at the Macau Open Grand Prix Gold on Friday
Suhas - Politics
The United States lodged an appeal on Friday to challenge a World Trade Organization ruling that said it had failed to bring its meat labelling laws into line with global trade rules.
Sachin - Sports
After four games without a goal, Mumbai City FC would look to end their goal drought and get back to winning ways when they take on Delhi Dynamos at the Jawaharlal Nehru Stadium on Friday.
This will keeps on going.
Question :
We neet to copy all Suhas data in one txt file and Sachin data in another txt file. we need to separate the two data in 2 txt file.
I have showed for 1 txt but need to do for (20 txt file). I mean 20 txt for Suhas and 20 txt for Sachin.
Need your help to build R code

Here, I created two files that start with Sports i.e Sports1.txt, Sports2.txt
files <- list.files(pattern='^Sports\\d')
files
#[1] "Sports1.txt" "Sports2.txt"
lst <- lapply(files, function(x) {x1 <- readLines(x)
x2 <- x1[x1!='']
indSuh <- grep("^Suhas", x2)
indSach <- grep("^Sach", x2)
list(x2[indSuh], x2[indSach])})
Map(function(i, x, y){nm2 <- paste(y, i, '.txt', sep='')
lapply(seq_along(x), function(j) write.table(x[[j]],
file=nm2[j]))},seq_along(lst), lst, list(nm1))

Here's one approach using two packages I maintain, qdap and qdapTools. I just added a function to qdapTool loc_split that will work nicely for this but you'll need the development version.
First getting the packages to get started:
library(devtools)
install_github("trinker/qdapTools")
library(qdap); library(qdapTools)
Now the code:
## path of folder with txt files
fileloc <- "mydata"
## Read in Files
fls <- dir(fileloc)
input <- file.path(fileloc , fls[tools::file_ext(fls) == "txt"])
m <- unlist(lapply(input, readLines))
## Determine location of blank lines
locs <- grep("^([a-zA-Z]+)\\s*-\\s*([a-zA-Z]+)$", m)
## split text on locations of group name with hyphen
out1 <- loc_split(m, locs)
## extract the meta data
meta <- sapply(out1, "[", 1)
## create a data.frame of text and meta data
dat <- data.frame(
setNames(colSplit(meta, "-"), c("group", "topic")),
text = sapply(out1, function(x) unbag(x[-1])),
stringsAsFactors = FALSE
)
## split on the group variable (could do for topic or topic & group)
out2 <- split(dat[["text"]], dat[["group"]])
## Write out the lines using cat and the Map function
Map(function(x, y) {
cat(paste(x, collapse="\n\n"), file=sprintf("%s.txt", y))
}, out2, names(out2))
Note that this first makes a data frame with meta data about each text that looks like:
group topic text
1 Suhas Politics Pope Francis has highlighted the plight of re...
2 Sachin Sports Defending champion PV Sindhu continued her go...
3 Suhas Politics The United States lodged an appeal on Friday ...
As this can be useful.

Related

String match error "invalid regular expression, reason 'Out of memory'"

I have a table that is shaped like this called df (the actual table is 16,263 rows):
title date brand
big farm house 2022-01-01 A
ranch modern 2022-01-01 A
town house 2022-01-01 C
Then I have a table like this called match_list (the actual list is 94,000 rows):
words_for_match
farm
town
clown
beach
city
pink
And I'm trying to filter the first table to just be rows where the title contains a word in the words_for_match list. So I do this:
match_list <- match_list$words_for_match
match_list <- paste(match_list, collapse = "|")
match_list <- sprintf("\\b(%s)\\b", match_list)
df %>%
filter(grepl(match_list, title))
But then I get the following error:
Problem while computing `..1 = grepl(match_list, subject)`.
Caused by error in `grepl()`:
! invalid regular expression, reason 'Out of memory'
If I filter the table with 94,000 rows to just 1,000 then it runs, so it appears to just be a memory issue. So I'm wondering if there's a less memory-intensive way to do this or if this is an example of needing to look beyond my computer for computation. Advice on either pathway (or other options) is welcome. Thanks!
You could keep titles sequentially, let's say you have 10 titles that match 'farm' you do not need to evaluate those titles with other words.
Here a simple implementation :
titles <- c("big farm house", "ranch modern", "town house")
words_for_match <- c("farm", "town", "clown", "beach", "city", "pink")
titles.to.keep <- c()
for(w in words_for_match)
{
w <- sprintf("\\b(%s)\\b", w)
is.match <- grepl(w, titles)
titles.to.keep <- c(titles.to.keep, titles[is.match])
titles <- titles[!is.match]
print(paste(length(titles), "remaining titles"))
}
titles.to.keep
If you have a prior on the frequency of words on match_list, it's better to start with the most frequent ones.
UPDATE
You can also make a mix with your previous strategy to make it faster :
gr.size <- 20
gr.words <- split(words_for_match, ceiling(seq_along(words_for_match) / gr.size))
gr.words <- sapply(gr.words, function(words)
{
words <- paste(words, collapse = "|")
sprintf("\\b(%s)\\b", words)
})
and then iterate on gr.words and not on words_for_match in the first code chunk.

Cleaning Data from PDF file

I am trying to scrape data from a pdf downloaded from the link below and store as a datatable for analysis.
https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf.
Heres what I have so far;
require(pdftools)
require(data.table)
require(stringr)
url <- "https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf"
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl)-1)]
dfl <- str_split(dfl, pattern = "(\n)")
This code nearly works, however in the notes column whereby the text spills on to a new page due to a \n I end up with the code spilling over to a new line. For example, on the 19-Jan-84 the notes column should read;
Corporate Event - Acquisition of Eagle Star by BAT Industries
But with my code, the "BAT Industries" spills over onto a new line whereas I would like it to be in the same string as the line above.
Once the code as run I would like to have the same table as the pdf with all the text going into the correct columns.
Thanks.
We may use the following manipulations.
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl) - 1)]
# Getting rid of the last line in every page
dfl <- gsub("\nFTSE Russell \\| FTSE 100 – Historic Additions and Deletions, November 2018[ ]+?\\d{1,2} of 12\n", "", dfl)
# Splitting not just by \n, but by \n that goes right before a date (positive lookahead)
dfl <- str_split(dfl, pattern = "(\n)(?=\\d{2}-\\w{3}-\\d{2})")
# For each page...
dfl <- lapply(dfl, function(df) {
# Split vectors into 4 columns (sometimes we may have 5 due to the issue that
# you mentioned, so str_split_fixed becomes useful) by possibly \n and
# at least two spaces.
df <- str_split_fixed(df, "(\n)*[ ]{2,}", 4)
# Replace any remaining (in the last columns) cases of possibly \n and
# at least two spaces.
df <- gsub("(\n)*[ ]{2,}", " ", df)
colnames(df) <- c("Date", "Added", "Deleted", "Notes")
df[df == ""] <- NA
data.frame(df[-1, ])
})
head(dfl[[1]])
# Date Added Deleted Notes
# 1 19-Jan-84 Charterhouse J Rothschild Eagle Star Corporate Event - Acquisition of Eagle Star by BAT Industries
# 2 02-Apr-84 Lonrho Magnet & Southerns <NA>
# 3 02-Jul-84 Reuters Edinburgh Investment Trust <NA>
# 4 02-Jul-84 Woolworths Barratt Development <NA>
# 5 19-Jul-84 Enterprise Oil Bowater Corporation Corporate Event - Sub division of company into Bowater Inds and Bowater Inc
# 6 01-Oct-84 Willis Faber Wimpey (George) & Co <NA>
I guess ultimately you are going to want a single data frame rather than a list of them. For that you may use do.call(rbind, dfl).

Parsing XML for Ancient Greek Plays with speaker and dialogue

I am currently trying to read Greek plays which are available online as XML files into a data frame with a dialogue and speaker column.
I run the following commands to download the XML and parse the dialogue and speakers.
library(XML)
library(RCurl)
url <- "http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.01.0186"
html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
speakersc <- xpathSApply(doc, "//speaker", xmlValue)
dialogue <- data.frame(text = plain.text, stringsAsFactors = FALSE)
speakers <- data.frame(text = speakersc, stringsAsFactors = FALSE)
However, I then encounter a problem. The dialogue will yield 300 rows (for 300 distinct lines in the play), but the speaker will yield 297.
The reason for the problem is due to the structure of the XML as reproduced below, where the <speaker> tag is not repeated for continued dialogue interrupted by stage direction. Because I have to separate the dialogue
with the <p> tag, it yields two dialogue rows, but only one speaker row, without duplicating the speaker accordingly.
<speaker>Creon</speaker>
<stage>To the Guard.</stage>
-<p>
You can take yourself wherever you please,
<milestone n="445" unit="line" ed="p"/>
free and clear of a heavy charge.
<stage>Exit Guard.</stage>
</p>
</sp>
-<sp>
<stage>To Antigone.</stage>
<p>You, however, tell me—not at length, but briefly—did you know that an edict had forbidden this?</p>
</sp>
How can I parse the XML so the data will correctly yield the same number of dialogue rows for the same number of corresponding speaker rows?
For the above example, I would like the resulting data frame to either contain two rows for Creon's dialogue corresponding to the two lines of dialogue prior and after the stage direction, or one row which treats Creon's dialogue as one line ignoring the separation due to the stage direction.
Thank you very much for your help.
Consider using xpath's forward looking following-sibling to search for the next <p> tag when speaker is empty, all while iterating through <sp> which is the parent to <speaker> and <p>:
# ALL SP NODES
sp <- xpathSApply(doc, "//body/descendant::sp", xmlValue)
# ITERATE THROUGH EACH SP BY NODE INDEX TO CREATE LIST OF DFs
dfList <- lapply(seq_along(sp), function(i){
data.frame(
speakers = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker,'')"), xmlValue),
dialogue = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker/following-sibling::p[1], ' ',
//body/descendant::sp[position()=",i+1," and not(speaker)]/p[1])"), xmlValue)
)
# ROW BIND LIST OF DFs AND SUBSET EMPTY SPEAKER/DIALOGUE
finaldf <- subset(do.call(rbind, dfList), speakers!="" & dialogue!="")
})
# SPECIFIC ROWS IN OP'S HIGHLIGHT
finaldf[85,]
# speakers
# 85 Creon
#
# dialogue
# 85 You can take yourself wherever you please,free and clear of a heavy
# charge.Exit Guard. You, however, tell me—not at length, but
# briefly—did you know that an edict had forbidden this?
finaldf[86,]
# speakers dialogue
# 87 Antigone I knew it. How could I not? It was public.
Another option is the read the url and make some updates before parsing XML, in this case replace milestone tags with a space to avoid mashing words together, remove stage tags and then fix the sp node without a speaker
x <- readLines(url)
x <- gsub("<milestone[^>]*>", " ", x) # add space
x <- gsub("<stage>[^>]*stage>", "", x) # no space
x <- paste(x, collapse = "")
x <- gsub("</p></sp><sp><p>", "", x) # fix sp without speaker
Now the XML has the same number of sp and speaker tags.
doc <- xmlParse(x)
summary(doc)
p sp speaker div2 placeName
299 297 297 51 25 ...
Finally, get the sp nodes and parse speaker and paragraph.
sp <- getNodeSet(doc, "//sp")
s1 <- sapply( sp, xpathSApply, ".//speaker", xmlValue)
# collapse the 1 node with 2 <p>
p1 <- lapply( sp, xpathSApply, ".//p", xmlValue)
p1 <- trimws(sapply(p1, paste, collapse= " "))
speakers <- data.frame(speaker=s1, dialogue = p1)
speaker dialogue
1 Antigone Ismene, my sister, true child of my own mother, do you know any evil o...
2 Ismene To me no word of our friends, Antigone, either bringing joy or bringin...
3 Antigone I knew it well, so I was trying to bring you outside the courtyard gat...
4 Ismene Hear what? It is clear that you are brooding on some dark news.
5 Antigone Why not? Has not Creon destined our brothers, the one to honored buri...
6 Ismene Poor sister, if things have come to this, what would I profit by loose...
7 Antigone Consider whether you will share the toil and the task.
8 Ismene What are you hazarding? What do you intend?
9 Antigone Will you join your hand to mine in order to lift his corpse?
10 Ismene You plan to bury him—when it is forbidden to the city?
...

Most efficient way to read key value pairs where values span multiple lines?

What is the fastest way to parse a text file such as the example below into a two column data.frame which then then be transformed into a wide format?
FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
Chiesa, Luca Maria
Brizzolari, Andrea
Santaniello, Enzo
Passero, Elena
Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015
Using readLines is problematic because the multi-line fields don't have the keys. Reading as fixed width table also doesn't work. Suggestions? If not for the multiline issue, this would be easily accomplished with a function that operates on each row/record like so:
x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key value
1 FN Thomson Reuters Web of Science
Notes: The fields will always be uppercase and two characters. The entire title and list of authors can be concatenated into a single cell.
This should work:
library(zoo)
x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)
x$V1[x$V1==" "] <- NA
x$V1 <- na.locf(x$V1)
res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")
Here's another idea, that might be useful if you want to stay in base R:
parseEntry <- function(entry) {
## Split at beginning of each line that starts with a non-space character
ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
## Clean up empty characters at beginning of continuation lines
ll <- gsub("\\n(\\s){3}", "", ll)
## Split each field into its two components
read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}
## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)
Read lines of the file into a character vector using readLines and append a colon to each key. The result is then in DCF format so we can read it using read.dcf - this is the function used to read R package DESCRIPTION files. The result of read.dcf is wide, a matrix with one column per key. Finally we create long, a long data.frame with one row per key:
L <- readLines("myfile.dat")
L <- sub("^(\\S\\S)", "\\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)

R html scrape with redirect links, word searches, and counts

I am trying to streamline a tedious process of online data collection with R scraping code. The website I am currently interested in is here : Wisconsin Bills- Author index.
The website features a redirect link to each legislator, and then under each legislator there is a list of bills introduced, and a link to the major action summaries for each bill. My end goal is to create a data frame that includes a column for legislator name, number of assembly bills (only links that that include "AB") introduced, number of bills passed the assembly, and number of bills signed into law.
Scraping the website, I have successfully created a data frame with each legislator's first name, last name, district, state (always WI) and year (always 1999, t-1 is when the session ended). Below is my code:
#specify the URL
url <- "https://docs.legis.wisconsin.gov/1997/related/author_index/assembly"
#download the HTML code
html <- getURL(url, ssl.verifypeer = FALSE, followlocation = TRUE)
#parse the HTML code
html.parsed <- htmlTreeParse(html, useInternalNodes = T)
# Get list of legislator names:
names <- xpathSApply(html.parsed, path="//a[contains(#href, 'authorindex')]", xmlValue)
# get all links into a list:
links <- xpathSApply(html.parsed2, "//a/#href")
# see what I have:
head(links) # still have hrefs in there
links <- as.vector(links)
head(links) # good, hrefs are dropped.
# I only need the links that begin with /document/authorindex/1997.
typeof(links) # confirming its character
links # looking to see which ones to keep (only ones with "authorindex" and "A__", where the number that follows A is the district)
links <- links[14:114] # now the links only have the legislator redirects!!!
# Lets begin to build the final data frame needed:
# first, take a look at names- there are 104, but there are only 100 legislators...
names # elements 3-103 are leg names
names <- names[3:103]
# split up by first name, last name, etc.
names <- as.vector(names)
names1 <- strsplit(names, ",")
last.names <- sapply(names1, "[[", 1) # good- create a data frame
id = c(1:101)
df <- data.frame(ID= id)
df$last.name = last.names # now have an ID and their last name.
# now need district, party, and first names.
first_names <- strsplit(names, "p.")
first_names # now republicans have 3 elements, dems have 2, first word of 2nd element is first name
# do another strsplit
first_names <- as.character(first_names)
first_names <- strsplit(first_names, " ()")
first_names # 4th element is almost always their name! do it that way, correct those that messed up by hand
first_names <- sapply(first_names, "[[", 4)
first_names # 10 (Timothy), 90 (William) 80 (Joan H) 80 (Tom) 47 (John)
# 25 (Jose) 17 (Stephen) 5 (Spencer)
first_names[5] <- "Spencer"
first_names[10] <- "Timothy"
first_names[90] <- "William"
first_names[80] <- "Joan H."
first_names[81] <- "Tom"
first_names[47] <- "John"
first_names[25] <- "Jose"
first_names[17] <- "Stephen"
df$first.name <- first_names # first names- done.
# district:
district <- regmatches(names, gregexpr("[[:digit:]]+", names))
df$district <- district
df$state <- "WI"
df$year <- 1999
Now, I'm stumped. I need to follow each redirect link, and count the number of AB links under that legislator's name ONLY, follow the AB links, and count the # of AB sites for each legislator that have the word "passed" in them and the # of AB sites that have the word "Sen." in them. I would thus like to add to the existing df the following columns:
Bills Introduced Bills Passed Assembly Bills Signed into Law
4 3 2
39 18 14
Etc. I get the sense I need to use loops, but I don't know how to approach it.
Any help would be incredibly appreciated.
Thank you!

Resources