R Programming Need String based unique solution for splitting huge text

R Programming Need String based unique solution for splitting huge text - r

I have a text file with a sample text like below all in small case:
"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a
once powerful oil minister and former head of state oil company pdvsa, in
connection with an alleged $4.8 billion vienna-based corruption scheme, the
state prosecutor's office announced on friday.
5.5 hours ago
— reuters
amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its
online marketplace when they mistakenly search for "brikenstock",
"birkenstok", "bierkenstock" and other variations in google.
6 hours ago
— business standard"
What I need in R is to get these two pieces of text, separated out.
The first piece of text would correspond with the text1 variable and the second piece of text should correspond with the text2 variable.
Please remember I have many text-like paragraphs in this file. The solution would have to work for, say, 100,000 texts.
The only thing I thought that could be used as a delimiter is "—" but with that I lose the source of the information such as "reuters" or "business standard". I need that as well.
Would you know how to accomplish this in R?

Read the text from field with readLines and then split on the shifted cumsum of the occurence of that special dash in from of the publisher:
Lines <- readLines("Lines.txt") # from file in wd()
split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "
[3] "once powerful oil minister and former head of state oil company pdvsa, in "
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."
[6] "5.5 hours ago"
[7] "— reuters"
$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."
[5] "6 hours ago"
[6] "— business standard'"
It's not a regular "-". Its a "—". And notice the by default readLines will omit the blank lines.

Here's what I could do. I do not like the loop in this, but I could not vectorize it. Hopefully this answer will at least serve as a starting point for other better answers.
Assumptions: All publisher names are preceeded by "— "
TEXT <- read.delim2("C:/Users/Arani.das/Desktop/TEXT.txt", header=FALSE, quote="", stringsAsFactors=F)
TEXT$Publisher <- grepl("— ", TEXT$V1)
TEXT$V1 <- gsub("^\\s+|\\s+$", "", TEXT$V1) #trim whitespaces in start and end of line
TEXT$FLAG <- 1 #grouping variable
for(i in 2:nrow(TEXT)){
if(TEXT$Publisher[i-1]==T){TEXT$FLAG[i]=TEXT$FLAG[i]+1}else{TEXT$FLAG[i]=TEXT$FLAG[i-1]}
} # Grouping entries
TEXT <- data.table::data.table(TEXT, key="FLAG")
TEXT2 <- TEXT[, list(News=paste0(V1[1:(length(V1)-2)], collapse=" "), Time=V1[length(V1)-1], Publisher=V1[length(V1)]), by="FLAG"]
Output:
FLAG News Time Publisher
1 Venezuela... 5.5 hours ago — reuters
2 amazon... 6 hours ago — business standard

Related

How to extract multiple quotes from multiple documents in R?

I have several Word files containing articles from which I want to extract the strings between quotes. My code works fine if I have one quote per article but if I have more than one R extracts the sentence that separates one quote from the next.
Here is the text from my articles:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, “I adore tigers”. This is the end.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
And this is my code:
library(readtext)
library(stringr)
#' folder where you've saved your articles
path <- "articles"
#' reads in anything saved as .docx
mydata <-
readtext(paste0(path, "\\*.docx")) #' make sure the Word document is saved as .docx
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))
#' extract the quotes
stringi::stri_extract_all_regex(str = mydata$text, pattern = '(?<=").*?(?=")')
The output is:
[[1]]
[1] "We got him and he is healthy,"
[2] " said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, "
[3] "I adore tigers"
[[2]]
[1] "The target catalysed much greater conservation action, which was desperately needed,"
You can see that the second element of the first output is incorrect. I don't want to include
" said Houston Police Department (HPD) Major Offenders Commander Ron
Borza. He went on to say, "

Well, technically the second element of the first output is within quotes so the code is working correctly as per the pattern used. A quick fix would be to remove every 2nd entry from the list.
sapply(
stringi::stri_extract_all_regex(str = text, pattern = '(?<=").*?(?=")'),
`[`, c(TRUE, FALSE)
)
#[[1]]
#[1] "We got him and he is healthy," "I adore tigers"
#[[2]]
#[1] "The target catalysed much greater conservation action, which was desperately needed,"

We can do this with base R
sapply(regmatches(text, gregexpr('(?<=")[^"]+)', text, perl = TRUE)), function(x) x[c(TRUE, FALSE)])

Map trough a nested list using regex to remove entrys of a character vector

I have a nested list (https://www.filehosting.org/file/details/841630/bribe.RData) which I want to transform into a tibble. In the list are some character vectors which differ in length from the other vectors. This is due to a webscraping problem which could not be fixed before in another question.
Since all extra character strings in these vectors have a specific pattern I wrote a regex to find and remove those. These extra strings only appear in the 5th character vector of each sublist. How can I map over all those sublists and vectors to get rid of the extra strings at once.
The mapping version should extract the following of every 5th character vector of each sub list :
MMW :
bribe.test <-stringr:::str_remove( bribe.info[[1]][[5]],
"(\\\r\\\n\\s*){3}.+(\\\r\\\n\\s*){2}")
How can I do this? Is transpose() in combination with simplfy_all() an option? If so, how do I exactly use this here in this case?
The final tibble should have the structure that all character vector x of each sublist create one column in the data.frame.
UPDATE :
To make things a little bit more accessible I will include the output of the list
bribe.info[[1]][[5]]
[1] "\r\n Hello sir, My uncle just coming india yesterday >night at ahmedabad airport from New Zealand. And i gave him 2 iphone , iphone 8 >plus and iphone 11 pro...Read more\r\n "
[2] "\r\n Date of the incident: 29th December 2019\nTime of >incident: Around 8 PM in the evening\nPlace of incident: ECR road, Pondicherry >to Tamil Nadu check pos...Read more\r\n "
[3] "\r\n Dear Sir,\n\nThis is not the first time I am >facing this issue with Rohit Gas Agency. I tried to bring it to the notice of >Indane. Its of no use. Rohit ...Read more\r\n "
[4] "\r\n \r\n \r\n >How to get a LPG gas connection\r\n \r\n "
[5] "\r\n I paid bribe today to a police officer who came >for passport verification of my mother. Even after providing all supporting >documents and required inf...Read more\r\n "
[6] "\r\n I have asked to pay bribe to avoid huge penalty >for putting tent sheet on car windows. Police asked me to pay 1100 rs fine or >pay bribe instead of tha...Read more\r\n "
[7] "\r\n Help desk officer prashant who are trapping people >to make work done by giving bribes to higher officials at malakpet rto malakpet >Hyderabad ...Read more\r\n "
[8] "\r\n Get free shipping when you buy the Revolution the >great american electric cigarette machine, within the continental US from >https://hardworkingproduct...Read more\r\n "
[9] "\r\n I Would like to Inform you that a lot of >corruption is going on in the DC Office Bangalore Urban Dept. I am not paid >bribe directly there is lot more...Read more\r\n "
[10] "\r\n Are you interested in selling one of your k1dney >for a good amount of 14Crore 7 cr Advance kindly Contact us now 9663960578 >.\n...Read more\r\n "
[11] "\r\n Are you interested in selling one of your k1dney >for a good amount of 14Crore 7 cr Advance kindly Contact us now, as we are >looking for k1dney donor, ...Read more\r\n
The regular expressions needs to filter vector 4 of the list. Which works fine. But I have no idea how I can do this for bribe.info[[2]][[5]],bribe.info[[3]][[5]]... and so on with map.

I found a way to do this with base R. So basically I hope anyone can provide a simple solution with purrr.
### remove text of embedded links transform the list into a tibble
briber <- list()
for(i in 1:130){
for(j in 1:5 )
bribe.info[[i]][[j]] <- subset(bribe.info[[i]][[j]],!grepl("(\\\r\\\n\\s*){3}.+(\\\r\\\n\\s*){2}", bribe.info[[i]][[j]] ))
}
bribe.tibble <- bribe.info %>% map(~ as_tibble(.x, .name_repair = "unique")) %>% bind_rows()

counting sentences containing a specific key word in R

UPDATE
Here is what I have done so far.
library(tm)
library(NLP)
library(SnowballC)
# set directory
setwd("C:\\Users\\...\\Data pretest all TXT")
# create corpus with tm package
pretest <- Corpus(DirSource("\\Users\\...\\Data pretest all TXT"), readerControl = list(language = "en"))
pretest is a large SimpleCorpus with 36 elements.
My folder contains 36 txt files.
# check what went in
summary(pretest)
# create TDM
pretest.tdm <- TermDocumentMatrix(pretest, control = list(stopwords = TRUE,
tolower = TRUE, stemming = TRUE))
# convert corpus to data frame
dataframePT <- data.frame(text = unlist(sapply(pretest, `[`, "content")),
stringsAsFactors = FALSE)
dataframePT has 36 observations. So I think until here it is okay.
# load stringr library
library(stringr)
# define sentences
v = strsplit(dataframePT[,1], "(?<=[A-Za-z ,]{10})\\.", perl = TRUE)
lapply(v, function(x) (stringr::str_count(x, "gain")))
My output looks like this
...
[[35]]
[1] NA
[[36]]
[1] NA
So there are actually 36 files, so that's good. But I don't know why it returns NA.
Thank you in advance for any suggestions.

library(NLP)
library(tm)
library(SnowballC)
Load data:
data("crude")
crude.tdm <- TermDocumentMatrix(crude, control = list(stopwords = TRUE, tolower = TRUE, stemming= TRUE))
First convert corpus to data frame
dataframe <- data.frame(text = unlist(sapply(crude, `[`, "content")), stringsAsFactors = F)
one can also inspect the content: crude[[2]]$content
now we need to define a sentence - here I define it with an entity that has at least 10 A-Z or a-z characters mixed with spaces and "," and ending with ".". And I split the documents by that rule using look-behind the .
z = strsplit(dataframe[,1], "(?<=[A-Za-z ,]{10})\\.", perl = T)
but this is not needed for crude corpus since every sentence ends with .\n so one can do:
z = strsplit(dataframe[,1], "\\.n\", perl = T)
I will stick with my previous definition of sentence since one wants it functioning not only for crude corpus. The definition is not perfect so I am keen on hearing your thoughts?
Lets check the output
z[[2]]
[1] "OPEC may be forced to meet before a\nscheduled June session to readdress its production cutting\nagreement if the organization wants to halt the current slide\nin oil prices, oil industry analysts said"
[2] "\n \"The movement to higher oil prices was never to be as easy\nas OPEC thought"
[3] " They may need an emergency meeting to sort out\nthe problems,\" said Daniel Yergin, director of Cambridge Energy\nResearch Associates, CERA"
[4] "\n Analysts and oil industry sources said the problem OPEC\nfaces is excess oil supply in world oil markets"
[5] "\n \"OPEC's problem is not a price problem but a production\nissue and must be addressed in that way,\" said Paul Mlotok, oil\nanalyst with Salomon Brothers Inc"
[6] "\n He said the market's earlier optimism about OPEC and its\nability to keep production under control have given way to a\npessimistic outlook that the organization must address soon if\nit wishes to regain the initiative in oil prices"
[7] "\n But some other analysts were uncertain that even an\nemergency meeting would address the problem of OPEC production\nabove the 15.8 mln bpd quota set last December"
[8] "\n \"OPEC has to learn that in a buyers market you cannot have\ndeemed quotas, fixed prices and set differentials,\" said the\nregional manager for one of the major oil companies who spoke\non condition that he not be named"
[9] " \"The market is now trying to\nteach them that lesson again,\" he added.\n David T"
[10] " Mizrahi, editor of Mideast reports, expects OPEC\nto meet before June, although not immediately"
[11] " However, he is\nnot optimistic that OPEC can address its principal problems"
[12] "\n \"They will not meet now as they try to take advantage of the\nwinter demand to sell their oil, but in late March and April\nwhen demand slackens,\" Mizrahi said"
[13] "\n But Mizrahi said that OPEC is unlikely to do anything more\nthan reiterate its agreement to keep output at 15.8 mln bpd.\"\n Analysts said that the next two months will be critical for\nOPEC's ability to hold together prices and output"
[14] "\n \"OPEC must hold to its pact for the next six to eight weeks\nsince buyers will come back into the market then,\" said Dillard\nSpriggs of Petroleum Analysis Ltd in New York"
[15] "\n But Bijan Moussavar-Rahmani of Harvard University's Energy\nand Environment Policy Center said that the demand for OPEC oil\nhas been rising through the first quarter and this may have\nprompted excesses in its production"
[16] "\n \"Demand for their (OPEC) oil is clearly above 15.8 mln bpd\nand is probably closer to 17 mln bpd or higher now so what we\nare seeing characterized as cheating is OPEC meeting this\ndemand through current production,\" he told Reuters in a\ntelephone interview"
[17] "\n Reuter"
and the original:
cat(crude[[2]]$content)
OPEC may be forced to meet before a
scheduled June session to readdress its production cutting
agreement if the organization wants to halt the current slide
in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy
as OPEC thought. They may need an emergency meeting to sort out
the problems," said Daniel Yergin, director of Cambridge Energy
Research Associates, CERA.
Analysts and oil industry sources said the problem OPEC
faces is excess oil supply in world oil markets.
"OPEC's problem is not a price problem but a production
issue and must be addressed in that way," said Paul Mlotok, oil
analyst with Salomon Brothers Inc.
He said the market's earlier optimism about OPEC and its
ability to keep production under control have given way to a
pessimistic outlook that the organization must address soon if
it wishes to regain the initiative in oil prices.
But some other analysts were uncertain that even an
emergency meeting would address the problem of OPEC production
above the 15.8 mln bpd quota set last December.
"OPEC has to learn that in a buyers market you cannot have
deemed quotas, fixed prices and set differentials," said the
regional manager for one of the major oil companies who spoke
on condition that he not be named. "The market is now trying to
teach them that lesson again," he added.
David T. Mizrahi, editor of Mideast reports, expects OPEC
to meet before June, although not immediately. However, he is
not optimistic that OPEC can address its principal problems.
"They will not meet now as they try to take advantage of the
winter demand to sell their oil, but in late March and April
when demand slackens," Mizrahi said.
But Mizrahi said that OPEC is unlikely to do anything more
than reiterate its agreement to keep output at 15.8 mln bpd."
Analysts said that the next two months will be critical for
OPEC's ability to hold together prices and output.
"OPEC must hold to its pact for the next six to eight weeks
since buyers will come back into the market then," said Dillard
Spriggs of Petroleum Analysis Ltd in New York.
But Bijan Moussavar-Rahmani of Harvard University's Energy
and Environment Policy Center said that the demand for OPEC oil
has been rising through the first quarter and this may have
prompted excesses in its production.
"Demand for their (OPEC) oil is clearly above 15.8 mln bpd
and is probably closer to 17 mln bpd or higher now so what we
are seeing characterized as cheating is OPEC meeting this
demand through current production," he told Reuters in a
telephone interview.
Reuter
You can clean it a bit if you wish, removing the trailing \n but it is not needed for your request.
Now you can do all sorts of things, like:
Which sentences contain the word "gain"
lapply(z, function(x) (grepl("gain", x)))
or the frequency of word "gain" per sentence:
lapply(z, function(x) (stringr::str_count(x, "gain")))

Hi I recommend using filter function from dplyr package and grepl function to search a pattern inside
pattern <- "word1|word2"
df<- df %>%
filter(grepl(pattern,column_name)
The df would be limited to only those matching that condition. So then just use nrow function to count how many rows last :)
Example:
a1<-1:10
a2<-11:20
(data<-data.frame(a1,a2,stringsAsFactors = F))
a1 a2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
(data<-data %>% filter(grepl("5|7",data$a2)))
a1 a2
1 5 15
2 7 17
(nrow(data))
[1] 2

How do I parse the contents of hundreds of csvs that are in a list of dataframes and split on ";" and ";" in loops?

I'm working with a large number (1,983) of CSV files. Posts on stackoverflow have said that lists are easier to work with so I've approached my task that way. I have read the CSVs in and gotten the first part of my task accomplished: what is the maximum number of concurrent users of the application? (A:203) Here's that code:
# get a list of the files
files <- list.files("my_path_here",pattern="*.CSV$", recursive = TRUE, full.names=TRUE)
#read in the csv's and store them as a list of dataframes
tables <- lapply(files, read.csv)
#store the counts of the number of users here
counts<-rep(NA,length(tables))
#loop thru the files to find the count and store that value
for (i in 1:length(files)) {
counts[i] <- length(tables[[i]][[2]])
}
#what's the largest number?
max(counts)
#203
The 2nd part of the task is to show the count of each title for each file. The contents of each file will be something like this:
compute_0001 compute_0002
[1] 3/26/2015 6:00:00 Business System Manager;Lead CoPath Analyst
[2] Regional Histotechnologist;Hist Tech - Ht
[3] Regional Histotechnologist;Tissue Tech
[4] SDX Histotechnologist;Histology Tech
[5] SDX Histotechnologist;Histology Tech
[6] Regional Histotechnologist;Lab Asst II Histology
[7] CytoPrep Tech;Histo Tech - Ht
[8] Regional Histotechnologist;Tissue Tech
[9] Histology Supervisor;Supv Reg Lab Unit
[10] Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT
What will differ from file to file is the time stamp in compute_0001, name of the file and the number of users (ie length of the file).
My approach was to try this:
>col2 <- sapply(tables,summary, maxsum=300) # gives me a list of 1983 elements that is 23.6Mb
(I noticed that when doing a summary() on the files I would get something like this - which is why I was trying it)
>col2[[1]]
compute_0001 compute_0002
1] Business System Manager;Lead CoPath Analyst :1
[2] Regional Histotechnologist;Hist Tech - Ht :1
[3] Regional Histotechnologist;Tissue Tech :1
[4] SDX Histotechnologist;Histology Tech :1
[5] SDX Histotechnologist;Histology Tech :1
[6] Regional Histotechnologist;Lab Asst II Histology :2
[7] CytoPrep Tech;Histo Tech - Ht :4
[8] Regional Histotechnologist;Tissue Tech :1
[9 Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT :1
The above is actually many different people. For my purposes, [2],[3], [6] and [8] are the same title (even though the stuff after the ";" is different. The truth is that even [4] and [5] could also be considered the same as [2,3,6,8]).
That ":1" (or generally ":#") is the number of users with that title at that particular time. I was hoping to grab that character, make it numeric and add them up to get a count of the users with each title for each file. Each file is an observation at a particular datetime.
I tried something like this:
>for (s in 1:length(col2)) {
>split <- strsplit(col2[[s]][,2], ":")
>#... make it numeric so I can do addition with it
>num <- as.numeric(split[[s]][2])
>#... and put it in the correct df
>tables[[s]]$count <- num
# After dealing with the ":" I was going to handle splitting on the first ";"
>}
But I couldn't get the loop to iterate more than a single time or past the first element of col2.
A more experienced useR suggested something like this:
>strsplit(x = as.character(compute2[[s]]),split=";",fixed=TRUE)
He said "However this results in a messy list also, since there are multiple ";" in some lines. What I would #suggest is to use grep() with a regex that returns the text before the first ";"- use that with sapply(compute2,grep()) and then you can run sapply(??,table) on the list that is returned to tally the job titles."
I'd prefer not to get into regex but, following his advice, I tried:
>for (s in 1:length(tables)){
>+ split <- strsplit(x = >as.character(compute2[[s]]),split=";",fixed=TRUE)
>+ }
split is a list of only 122 , not nearly long enough so it's not iterating thru the loop either. So, I figured I'd skip the loop and try:
>title_split<- sapply(compute2, strsplit, x = as.character(compute2[[1]]),split=";",fixed=TRUE)
But that gave me more than 50 warnings and a matrix that had 105,000+ elements that was 20.2Mb in size.
Like I said, I'd prefer to not venture into the world of regex, since I think I should be able to split on the ":" first and then the first of the ";" and return the string that precedes the ";". I'm just not sure why the loop is failing.
What I eventually want is a table that shows the count of each title (collapsed for duplicates like [2],[3], [6] and [8] above) for each file (which represents an observation at a particular datetime). I'm pretty agnostic as to approach, so if I have to do it via regex, then so be it.
Sorry for the lengthy post but I suspect that part of my problem (besides being brand new to stackoverflow, R and not understanding regex well) is that I'm not well versed in list manipulation and I wanted you to have the context.
Many thanks for reading.

You data isn't easily reproducible, so I've created a simple list of fake data that I hope captures the essence of your data.
Make a list of fake data frames:
string1 = "53 Regional histotechnologist;text2 - more text"
string2 = "54 Regional histotechnologist;text2 - more text"
string3 = "CytoPrep Tech;text2 - more text"
tables = list(df1=data.frame(compute=c(string1, string2, string3)),
df2=data.frame(compute=c(string1, string2, string3)))
Count the number of rows in each data frame:
counts = sapply(tables, nrow)
Add a column that extracts job titles from the compute column. The regex pattern skips zero or more digit characters ([0-9]*) followed by zero or one space character (?) and then captures everything up to, but not including, the first semi-colon(([^;]*);) and then skips every character after the semi-colon (.*).
tables = sapply(names(tables), function(df) {
cbind(tables[[df]], title=gsub("[0-9]* ?([^;]*);.*", "\\1", tables[[df]][,"compute"]))
}, simplify=FALSE)
tables
$df1
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
$df2
compute title
1 53 Regional histotechnologist;text2 - more text Regional histotechnologist
2 54 Regional histotechnologist;text2 - more text Regional histotechnologist
3 CytoPrep Tech;text2 - more text CytoPrep Tech
Make a table of counts of each title for each data frame in tables:
title.table.list = lapply(tables, function(df) table(df$title))
title.table.list
$df1
CytoPrep Tech Regional histotechnologist
1 2
$df2
CytoPrep Tech Regional histotechnologist
1 2

How to create a column and replace value

Question
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
Expected output
1 An artist impression of a star system is responsible for a nova.
1 The team from university of VYU focus on a class of compounds.
1 The young people was seen enjoying the foorball match.
2 Scientist have made a breakthrough and solved a decades- old mystery by revealing how a powerful.
2 Heart attacks more due to nurture than nature.
2 SA footballer Senzo Meyiwa shot dead to save girlfriend
The data is in the csv format and it has got around 1000 data points, numbers are in columns(1) and sentence are in column(2). I need to split the string and retain the row number for that particular sentence. Need your help to build the r code
Note: Number and the sentence are two different columns
I have tried this code to string split but i need code for row index
x$qwerty <- as.character(x$qwerty)
sa<-list(strsplit(x$qwerty,".",fixed=TRUE))[[1]]
s<-unlist(sa)
write.csv(s,"C:\\Users\\Suhas\\Desktop\\out23.csv")

One inconvenience of vectorization in R is that they operate from "inside" the vector. That is, they operate on the elements themselves, rather than the elements in the context of the vector. Therefore the user loses the innate ability to keep track of the index, i.e. where element being operated on was located in the original object.
The workaround is to generate the index separately. This is easy to achieve with seq_along, which is an optimized version of 1:length(qwerty). Then you can just paste the index and the results together. In your case, you'll obviously want to do the pasteing before you unlist.

If your dataset is as shown above, may be this helps. You can read from the file as readLines("file.txt")
lines <- readLines(n=7)
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
lines1 <- lines[lines!='']
indx <- grep("^\\d", lines1)
lines2 <- unlist(strsplit(lines1, '(?<=\\.)(\\b| )', perl=TRUE))
indx <- grepl("^\\d+$", lines2)
res <- unlist(lapply(split(lines2,cumsum(indx)),
function(x) paste(x[1], x[-1])), use.names=FALSE)
res
#[1] "1 An artist impression of a star system is responsible for a nova."
#[2] "1 The team from university of VYU focus on a class of compounds."
#[3] "1 The young people was seen enjoying the football match."
#[4] "2 Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful."
#[5] "2 Heart attacks more due to nurture than nature."
#[6] "2 SA footballer Senzo Meyiwa shot dead to save girlfriend"
If you want it as 2 column data.frame
dat <- data.frame(id=rep(lines2[indx],diff(c(which(indx),
length(indx)+1))-1), Col1=lines2[!indx], stringsAsFactors=FALSE)
head(dat,2)
# id Col1
#1 1 An artist impression of a star system is responsible for a nova.
#2 1 The team from university of VYU focus on a class of compounds.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Programming Need String based unique solution for splitting huge text - r

Related

How to extract multiple quotes from multiple documents in R?

Map trough a nested list using regex to remove entrys of a character vector

counting sentences containing a specific key word in R

How do I parse the contents of hundreds of csvs that are in a list of dataframes and split on ";" and ";" in loops?

How to create a column and replace value

Categories

Resources