I have documents such as :
President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
And when I read those .txt documents I would like to create a second column indicating the speaker name.
So what I tried was to first create a list of all possible names and replace them..
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("#Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","#President Dr. Norbert Lammert:")
prok <- scan(".txt", what = "character", sep = "\n")
prok <- mgsub(members,members_r,prok)
prok <- as.data.frame(prok)
prok$speaker <- grepl("#[^\\#:]*:",prok$prok, ignore.case = T)
My plan was to then get the name between # and : via regex if speaker == true and apply it downwards until there is a different name (and remove all applause/shout brackets obviously), but that is also where I am not sure how I could do that.
Here is the approach:
require (qdap)
#text is the document text
# remove round brackets and text b/w ()
a <- bracketX(text, "round")
names <- c("President Dr. Norbert Lammert","Alexander Dobrindt" )
searchString <- paste(names[1],names[2], sep = ".+")
# Get string from names[1] till names[2] with the help of searchString
string <- regmatches(a, regexpr(searchString, a))
# remove names[2] from string
string <- gsub(names[2],"",string)
This code can be looped when there are more than 2 names
Here is an approach leaning heavily on dplyr.
First, I added a sentence to your sample text to illustrate why we can't just use a colon to identify speaker names.
sampleText <-
"President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
This sentence right here: it is an example of a problem"
I then split the text to simulate the format that it appears you are reading it in (which also puts each speech in a part of a list).
splitText <- strsplit(sampleText, "\n")
Then, I am pulling out all of the potential speakers (anything that precedes a colon) to
allSpeakers <- lapply(splitText, function(thisText){
grep(":", thisText, value = TRUE) %>%
gsub(":.*", "", .) %>%
gsub("\\(", "", .)
}) %>%
unlist() %>%
unique()
Which gives us:
[1] "President Dr. Norbert Lammert"
[2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure"
[3] "Volker Kauder [CDU/CSU]"
[4] "This sentence right here"
Obviously, the last one is not a legitimate name, so should be excluded from our list of speakers:
legitSpeakers <-
allSpeakers[-4]
Now, we are ready to work through the speech. I have included stepwise comments below, instead of describing in text here
speechText <- lapply(splitText, function(thisText){
# Remove applause and interjections (things in parentheses)
# along with any blank lines; though you could leave blanks if you want
cleanText <-
grep("(^\\(.*\\)$)|(^$)", thisText
, value = TRUE, invert = TRUE)
# Split each line by a semicolor
strsplit(cleanText, ":") %>%
lapply(function(x){
# Check if the first element is a legit speaker
if(x[1] %in% legitSpeakers){
# If so, set the speaker, and put the statement in a separate portion
# taking care to re-collapse any breaks caused by additional colons
out <- data.frame(speaker = x[1]
, text = paste(x[-1], collapse = ":"))
} else{
# If not a legit speaker, set speaker to NA and reset text as above
out <- data.frame(speaker = NA
, text = paste(x, collapse = ":"))
}
# Return whichever version we made above
return(out)
}) %>%
# Bind all of the rows together
bind_rows %>%
# Identify clusters of speech that go with a single speaker
mutate(speakingGroup = cumsum(!is.na(speaker))) %>%
# Group by those clusters
group_by(speakingGroup) %>%
# Collapse that speaking down into a single row
summarise(speaker = speaker[1]
, fullText = paste(text, collapse = "\n"))
})
This yields
[[1]]
speakingGroup speaker fullText
1 President Dr. Norbert Lammert I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt.
2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem
If you prefer to have each line of text separately, replace the summarise at the end with mutate(speaker = speaker[1]) and you will get one line for each line of the speech, like this:
speaker text speakingGroup
President Dr. Norbert Lammert I declare the session open. 1
President Dr. Norbert Lammert I will now give the floor to Bundesminister Alexander Dobrindt. 1
Alexander Dobrindt, Minister for Transport and Digital Infrastructure 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure This sentence right here: it is an example of a problem 2
This seems to work
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("#Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","#President Dr. Norbert Lammert:")
testprok <- read.table("txt",header=FALSE,quote = "\"",comment.char="",sep="\t")
testprok$V1 <- mgsub(members,members_r,testprok$V1)
testprok$V2 <- ifelse(grepl("#[^\\#:]*:",testprok$V1),testprok$V1,NA)
####function from http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
testprok$V2 = repeat.before(testprok$V2)
Related
Consider I have the below mentioned input character;
text_input <- c("ADOPT", "A", "FAIL", "FAST")
test <- c("TEST", "INPUT", "FAIL", "FAST")
I would like to match both the inputs and extract the words which occurred in common in text_input, I would like to something similar to str_extract.
I do understand that str_extract uses a matching pattern or word to do it, but,my test data consists of around 500,000 words. Any inputs would be really helpful.
Expected Outcome:
"FAIL", "FAST"
EDIT
Just adding one more question here... What happens when the Input is a pure string, like, the one provided below;
text_input <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")
test <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")
Is it possible to perform string match even in this case, as mentioned above.
Note: Changed the input and testing string.
If we need the characters to be extracted
library(stringr)
str_extract(text_input, paste0("[", test, "]+"))
If we are looking for full string match
library(data.table)
fintersect(data.table(col1 = text_input), data.table(col1 = test))
For the easy example you may use intersect() as already was stated in the comments.
text_input1 <- c("ADOPT", "A", "FAIL", "FAST")
test1 <- c("TEST", "INPUT", "FAIL", "FAST")
intersect(text_input1, test1)
# [1] "FAIL" "FAST"
The long example is a little more complicated.
text_input2 <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")
phrases <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")
The test string vector you've defined - I'll call it phrases contains compound terms of two (or probably more) words i.e. containing spaces. Therefore, we need a regular expression rx1 that can handle it. It is not clear if you want case sensitive matches or not, you'd need tolower() both the phrases and the text for the latter. Next we test whether there's a match or not. If so we extend the regex to rx2 so that we can use it well with gsub() replacement functionality. We Vectorize() our function that it can handle vectors of phrases.
matchPhrase <- Vectorize(function(phr, txt, tol=FALSE) {
rx1 <- gsub(" ", "\\\\s", phr) # handle spaces
if (tol) { # optional tolower
rx1 <- tolower(rx1)
txt <- tolower(txt)
}
if (regexpr(rx1, txt) > 0) { # test for matches
rx2 <- paste0(".*(", rx1, ").*")
return(gsub(rx2, "\\1", txt)) # gsub extraction
} else {
return(NA) # we want NA for no matches
}
})
Default without case-sensitivity.
matchPhrase(phrases, text_input2, tol=FALSE)
# Data Scientist McKinsey ORGANIZATIONS FAST
# "Data Scientist" "McKinsey" NA NA
Non-case-sensitive also finds "organizations".
matchPhrase(phrases, text_input2, tol=TRUE)
# Data Scientist McKinsey ORGANIZATIONS FAST
# "data scientist" "mckinsey" "organizations" NA
For a clean output just do:
as.character(na.omit(matchPhrase(phrases, text_input2, tol=TRUE)))
# [1] "data scientist" "mckinsey" "organizations"
Note: Probably you need to adapt the function several times for your specific needs/desired outputs. Actually the quanteda package is quite sophisticated in doing this kind of stuff.
This can also be achieved using the package fuzzyjoin, which contains a way to join df's based on regex.
text_input <- c("ADOPT", "A", "FAIL", "FAST")
regex <- c("TEST", "INPUT", "FAIL", "FAST")
library(fuzzyjoin)
library(dplyr)
df <- tibble( text = text_input )
df.regex <- tibble( regex_name = regex )
# now we can regex match them
df %>%
regex_left_join( df.regex, by = c( text = "regex_name" ) )
# # A tibble: 4 x 2
# text regex_name
# <chr> <chr>
# 1 ADOPT NA
# 2 A NA
# 3 FAIL FAIL
# 4 FAST FAST
#or only regex 'hits'
df %>%
regex_inner_join( df.regex, by = c( text = "regex_name" ) )
# # A tibble: 2 x 2
# text regex_name
# <chr> <chr>
# 1 FAIL FAIL
# 2 FAST FAST
I am trying to scrape data from a pdf downloaded from the link below and store as a datatable for analysis.
https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf.
Heres what I have so far;
require(pdftools)
require(data.table)
require(stringr)
url <- "https://www.ftse.com/products/downloads/FTSE_100_Constituent_history.pdf"
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl)-1)]
dfl <- str_split(dfl, pattern = "(\n)")
This code nearly works, however in the notes column whereby the text spills on to a new page due to a \n I end up with the code spilling over to a new line. For example, on the 19-Jan-84 the notes column should read;
Corporate Event - Acquisition of Eagle Star by BAT Industries
But with my code, the "BAT Industries" spills over onto a new line whereas I would like it to be in the same string as the line above.
Once the code as run I would like to have the same table as the pdf with all the text going into the correct columns.
Thanks.
We may use the following manipulations.
dfl <- pdf_text(url)
dfl <- dfl[2:(length(dfl) - 1)]
# Getting rid of the last line in every page
dfl <- gsub("\nFTSE Russell \\| FTSE 100 – Historic Additions and Deletions, November 2018[ ]+?\\d{1,2} of 12\n", "", dfl)
# Splitting not just by \n, but by \n that goes right before a date (positive lookahead)
dfl <- str_split(dfl, pattern = "(\n)(?=\\d{2}-\\w{3}-\\d{2})")
# For each page...
dfl <- lapply(dfl, function(df) {
# Split vectors into 4 columns (sometimes we may have 5 due to the issue that
# you mentioned, so str_split_fixed becomes useful) by possibly \n and
# at least two spaces.
df <- str_split_fixed(df, "(\n)*[ ]{2,}", 4)
# Replace any remaining (in the last columns) cases of possibly \n and
# at least two spaces.
df <- gsub("(\n)*[ ]{2,}", " ", df)
colnames(df) <- c("Date", "Added", "Deleted", "Notes")
df[df == ""] <- NA
data.frame(df[-1, ])
})
head(dfl[[1]])
# Date Added Deleted Notes
# 1 19-Jan-84 Charterhouse J Rothschild Eagle Star Corporate Event - Acquisition of Eagle Star by BAT Industries
# 2 02-Apr-84 Lonrho Magnet & Southerns <NA>
# 3 02-Jul-84 Reuters Edinburgh Investment Trust <NA>
# 4 02-Jul-84 Woolworths Barratt Development <NA>
# 5 19-Jul-84 Enterprise Oil Bowater Corporation Corporate Event - Sub division of company into Bowater Inds and Bowater Inc
# 6 01-Oct-84 Willis Faber Wimpey (George) & Co <NA>
I guess ultimately you are going to want a single data frame rather than a list of them. For that you may use do.call(rbind, dfl).
I have a dataframe corpus in R which looks like this :enter image description here
And I want to create n-grams(upto 5-grams) using loops or functions. currently, I am doing it manually in this way:
Sample corpus structure:
{"colleagues were also at the other two events in aberystwyth and flint and by all accounts had a great time",
"the lineup was whittled down to a more palatable five in when the bing crosby souffle going my way bested both gaslight and double indemnity proving oscar voters have always had a taste for pabulum",
"felt my first earthquake today whole building at work was shaking",
"she is the kind of mother friend and woman i aspire everyday to be",
"she was processed and released pending a court appearance",
"watching some sunday night despite the sadness i have been feeling i also feel very blessed and happy to be carrying another miracle",
"every night when we listen to poohs heartbeat our hearts feel so much happiness and peace",}
`onegram <- NGramTokenizer(corpusdf, Weka_control(min=1, max=1))
onegram <- data.frame(table(onegram))
onegram <- onegram[order(onegram$Freq, decreasing = TRUE),]
colnames(onegram) <- c("Word", "Freq")
onegram [1:15,]
bigram <- NGramTokenizer(corpusdf, Weka_control(min=2, max=2, delimiters = tokendelim))
bigram <- data.frame(table(bigram))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
colnames(bigram) <- c("Word", "Freq")
bigram [1:15,]`
Any ideas?
I don't know the function NGramTokenizer and couldn't get it to work. So here is a solution in quanteda, which produces individual tokens objects for each iteration (gram_1 for onegram, gram_2 for bigrams and so on):
corpusdf <- data.frame(text = c("colleagues were also at the other two events in aberystwyth and flint and by all accounts had a great time", "the lineup was whittled down to a more palatable five in when the bing crosby souffle going my way bested both gaslight and double indemnity proving oscar voters have always had a taste for pabulum", "felt my first earthquake today whole building at work was shaking", "she is the kind of mother friend and woman i aspire everyday to be", "she was processed and released pending a court appearance", "watching some sunday night despite the sadness i have been feeling i also feel very blessed and happy to be carrying another miracle", "every night when we listen to poohs heartbeat our hearts feel so much happiness and peace"),
stringsAsFactors = FALSE)
library("quanteda")
tokens <- tokens(corpusdf$text, what = "word")
for (n in seq_len(5)) {
temp <- tokens_ngrams(tokens, n = n, skip = 0L, concatenator = "_")
temp <- data.frame(table(unlist(temp)),
stringsAsFactors = FALSE)
colnames(temp) <- c("Word", "Freq")
temp <- temp[order(temp$Freq, decreasing = TRUE),]
assign(paste0("gram_", n), temp)
}
head(gram_2)
Output looks like this:
> head(gram_2)
Word Freq
53 had_a 2
101 to_be 2
1 a_court 1
2 a_great 1
3 a_more 1
4 a_taste 1
Update: After I realised NGramTokenizer belongs to the RWeka package and not tm, #phiver 's answer works for me
ngrams <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=1, max=5))
ngrams <- data.frame(table(ngrams),
stringsAsFactors = FALSE)
ngrams <- ngrams[order(ngrams$Freq, decreasing = TRUE),]
head(ngrams)
However, this mixes up all ngrams which does not make much sense if you want to rank frequencies (onegrams will naturally be on top). So here is a loop solution:
for (n in seq_len(5)) {
temp <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=n, max=n))
temp <- data.frame(table(unlist(temp)),
stringsAsFactors = FALSE)
colnames(temp) <- c("Word", "Freq")
temp <- temp[order(temp$Freq, decreasing = TRUE),]
assign(paste0("gram_", n), temp)
}
head(gram_2)
The text I am using is below.
So far, I have imported the text:
tempest.v <- scan("data/plainText/tempest.txt", what="character", sep="\n")
Identified where all of the speaker positions begin:
speaker.positions.v <- grep('^[^\\s]\\w+:', tempest.v)
Added a marker at the end of the text:
tempest.v <- c(tempest.v, "END:")
Here's the part where I'm having difficulty (assuming what I've already done is useful):
for(i in 1:length(speaker.positions.v)){
if(i != length(speaker.positions.v)){
speaker.name <- debate.v[speaker.positions.v[i]]
speaker.name <- strsplit(speaker.name, ":")
speaker.name <- unlist(speaker.name)
start <- speaker.positions.v[i]+1
end <- speaker.positions.v[i+1]-1
speaker.lines.v <- debate.v[start:end]
}
}
Now I have variable speaker.name that has, on the left-hand side of the split, the name of the character who is speaking. The right-hand side of the split is the dialogue only up through the first line break.
I set the start of the dialogue block at position [i]+1 and
the end at [i+1]-1 (i.e., one position back from the beginning of the subsequent speaker's name).
Now I have a variable, speaker.lines.v with all of the lines of dialogue for that speaker for that one speech.
How can I collect all of Prospero's then Miranda's (then any other character's) dialogue into a single (list? vector? data frame?) for analysis?
Any help with this would be greatly appreciated.
Happy New Year!
--- *TEXT ---
*Miranda: If by your art, my dearest father, you have
Put the wild waters in this roar, allay them.
The sky, it seems, would pour down stinking pitch,
But that the sea, mounting to the welkin's cheek,
Dashes the fire out. O, I have suffered
With those that I saw suffer -- a brave vessel,
Who had, no doubt, some noble creature in her,
Dash'd all to pieces. O, the cry did knock
Against my very heart. Poor souls, they perish'd.
Had I been any god of power, I would
Have sunk the sea within the earth or ere
It should the good ship so have swallow'd and
The fraughting souls within her.
Prospero: Be collected:
No more amazement: tell your piteous heart
There's no harm done.
Miranda: O, woe the day!
Prospero: No harm.
I have done nothing but in care of thee,
Of thee, my dear one, thee, my daughter, who
Art ignorant of what thou art, nought knowing
Of whence I am, nor that I am more better
Than Prospero, master of a full poor cell,
And thy no greater father.
Miranda: More to know
Did never meddle with my thoughts.
Prospero: 'Tis time
I should inform thee farther. Lend thy hand,
And pluck my magic garment from me. So:
[Lays down his mantle]
Lie there, my art. Wipe thou thine eyes; have comfort.
The direful spectacle of the wreck, which touch'd
The very virtue of compassion in thee,
I have with such provision in mine art
So safely ordered that there is no soul—
No, not so much perdition as an hair
Betid to any creature in the vessel
Which thou heard'st cry, which thou saw'st sink. Sit down;
For thou must now know farther.
--- END TEXT ---
I first saved the text you put here as test.txt. Then read it:
tempest <- scan("~/Desktop/test.txt", what = "character", sep = "\n")
Then pulled only the spoken lines, as you:
speakers <- tempest[grepl("^[^\\s]\\w+:", tempest)]
Then we split off the speaker's name:
speaker_split <- strsplit(speakers, split = ":")
And get the names:
speaker_names <- sapply(speaker_split, "[", 1L)
And what they said (collapsing because their lines may have had other colons that we lost):
speaker_parts <- sapply(speaker_split, function(x) paste(x[-1L], collapse = ":"))
From here we just need indices of who said what and we can do what we want:
prosp <- which(speaker_names == "Prospero")
miran <- which(speaker_names == "Miranda")
And play to your hearts content.
Who said the most words?
> sum(unlist(strsplit(speaker_parts[prosp], split = "")) == " ")
[1] 82
> sum(unlist(strsplit(speaker_parts[miran], split = "")) == " ")
[1] 67
Prospero.
What is the frequency of letters used by Miranda?
> table(tolower(unlist(strsplit(gsub("[^A-Za-z]", "", speaker_parts[miran]),
split = ""))))
a b c d e f g h i k l m n o p r s t u v w y
17 3 2 11 34 7 3 21 16 5 7 7 9 17 3 14 18 30 11 5 10 8
We're going to use the rebus package to create regular expressions, stringi to match those regular expressions, and data.table to store the data.
library(rebus)
library(stringi)
library(data.table)
First trim leading and trailing spaces from the lines
tempest.v <- stri_trim(tempest.v)
Get rid of empty lines
tempest.v <- tempest.v[nzchar(tempest.v)]
Remove stage directions
stage_dir_rx <- exactly(
OPEN_BRACKET %R%
one_or_more(printable()) %R%
"]"
)
is_stage_dir_line <- stri_detect_regex(tempest.v, stage_dir_rx)
tempest.v <- tempest.v[!is_stage_dir_line]
Match lines containing "character: dialogue".
character_dialogue_rx <- START %R%
optional(capture(one_or_more(alpha()) %R% lookahead(":"))) %R%
optional(":") %R%
zero_or_more(space()) %R%
capture(one_or_more(printable()))
matches <- stri_match_first_regex(tempest.v, character_dialogue_rx)
Store the matches in a data.table (we need this for the roll functionality). A line number key column is also needed in a moment.
tempest_data <- data.table(
line_number = seq_len(nrow(matches)),
character = matches[, 2],
dialogue = matches[, 3]
)
Fill in missing values, using the method described in this answer.
setkey(tempest_data, line_number)
tempest_data[, character := tempest_data[!is.na(character)][tempest_data, character, roll = TRUE]]
The data currently has line information preserved: each row contains one line of dialogue.
line_number character dialogue
1: 1 Miranda If by your art, my de....
2: 2 Miranda Who had, no doubt, so....
3: 3 Prospero Be collected: No more....
4: 4 Miranda O, woe the day!
5: 5 Prospero No harm. I have done ....
6: 6 Miranda More to know Did neve....
7: 7 Prospero 'Tis time I should in....
8: 8 Prospero Lie there, my art. Wi....
To get all the dialogue for a given character as a single string, summarise using the by argument.
tempest_data[, .(all_dialogue = paste(dialogue, collapse = "\n")), by = "character"]
I was interested in this question because I'm developing a series of tools for these types of tasks. Here is how to solve this problem using those tools.
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape", "trinker/qdapRegex")
pacman::p_load(dplyr)
pat <- '^[^\\s]\\w+:'
"tempest.txt" %>%
readLines() %>%
{.[!grepl("^(---)|(^\\s*$)", .)]} %>%
split_match(pat, regex=TRUE, include=TRUE) %>%
textshape::combine() %>%
{setNames(., sapply(., function(x) unlist(ex_default(x, pattern = pat))))} %>%
bind_list("person") %>%
mutate(content = gsub(pat, "", content)) %>%
`[` %>%
textshape::combine()
result
person content
1 Miranda: If by your art, my dearest father, you ...
2 Prospero: Be collected No more amazement tell you ..
To avoid combining (As #RichieCotton displays initially) leave off the last textshape::combine() in the chain.
This is text file around (20 txt file)
In each text file
Suhas - Politics
Pope Francis has highlighted the plight of refugees from Syria and Iraq and condemned extremism at the start of a key visit to Turkey.
Sachin - Sports
Defending champion PV Sindhu continued her good run and entered the semifinals of the women's singles competition after beating China's Han Li in three games at the Macau Open Grand Prix Gold on Friday
Suhas - Politics
The United States lodged an appeal on Friday to challenge a World Trade Organization ruling that said it had failed to bring its meat labelling laws into line with global trade rules.
Sachin - Sports
After four games without a goal, Mumbai City FC would look to end their goal drought and get back to winning ways when they take on Delhi Dynamos at the Jawaharlal Nehru Stadium on Friday.
This will keeps on going.
Question :
We neet to copy all Suhas data in one txt file and Sachin data in another txt file. we need to separate the two data in 2 txt file.
I have showed for 1 txt but need to do for (20 txt file). I mean 20 txt for Suhas and 20 txt for Sachin.
Need your help to build R code
Here, I created two files that start with Sports i.e Sports1.txt, Sports2.txt
files <- list.files(pattern='^Sports\\d')
files
#[1] "Sports1.txt" "Sports2.txt"
lst <- lapply(files, function(x) {x1 <- readLines(x)
x2 <- x1[x1!='']
indSuh <- grep("^Suhas", x2)
indSach <- grep("^Sach", x2)
list(x2[indSuh], x2[indSach])})
Map(function(i, x, y){nm2 <- paste(y, i, '.txt', sep='')
lapply(seq_along(x), function(j) write.table(x[[j]],
file=nm2[j]))},seq_along(lst), lst, list(nm1))
Here's one approach using two packages I maintain, qdap and qdapTools. I just added a function to qdapTool loc_split that will work nicely for this but you'll need the development version.
First getting the packages to get started:
library(devtools)
install_github("trinker/qdapTools")
library(qdap); library(qdapTools)
Now the code:
## path of folder with txt files
fileloc <- "mydata"
## Read in Files
fls <- dir(fileloc)
input <- file.path(fileloc , fls[tools::file_ext(fls) == "txt"])
m <- unlist(lapply(input, readLines))
## Determine location of blank lines
locs <- grep("^([a-zA-Z]+)\\s*-\\s*([a-zA-Z]+)$", m)
## split text on locations of group name with hyphen
out1 <- loc_split(m, locs)
## extract the meta data
meta <- sapply(out1, "[", 1)
## create a data.frame of text and meta data
dat <- data.frame(
setNames(colSplit(meta, "-"), c("group", "topic")),
text = sapply(out1, function(x) unbag(x[-1])),
stringsAsFactors = FALSE
)
## split on the group variable (could do for topic or topic & group)
out2 <- split(dat[["text"]], dat[["group"]])
## Write out the lines using cat and the Map function
Map(function(x, y) {
cat(paste(x, collapse="\n\n"), file=sprintf("%s.txt", y))
}, out2, names(out2))
Note that this first makes a data frame with meta data about each text that looks like:
group topic text
1 Suhas Politics Pope Francis has highlighted the plight of re...
2 Sachin Sports Defending champion PV Sindhu continued her go...
3 Suhas Politics The United States lodged an appeal on Friday ...
As this can be useful.