Insert special character in R based on existing words - r

I was looking for a intuitive solution for a problem of mine.
I have a huge list of words, in which i have to insert a special character based on some criteria.
So if a two/three letter word appear in a cell i want to add "+" right and left to it
Example
global b2b banking would transform to global +b2b+ banking
how to finance commercial ale estate would transform to how +to+ finance commercial +ale+ estate
Here is sample data set:
sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
"W Hotels")
data <- data.frame(sample)
Additionally is it possible to drop a row which has a character of length 1 ?
Example:
W Hotels
For all the one letter word i tried removing them with gsub,
gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample)
This should be removed from the data set set.
Any help is highly appreciated.
Edit 1
Thanks for the help, I added few more lines to it:
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)]
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data
sample
1 commercial++funding
2 global+++b2b+++banking
3 how++++to+++finance++commercial+++ale+++estate
4 international++currency++account
5 miami++imports++banking
6 hsbc++supply++chain++financing
7 international++business++expansion
8 grow++business+++in++++us+++banking
9 commercial++trade++asia++pacific
10 business++line+++of+++credits++hsbc
11 britain++commercial++banking
12 fx+++settlement++hsbc
Somehow i am unable to remove "+," with "," with gsub ? what am i doing wrong ?
so "fx+,settlement,hsbc" should be "fx+settlement,hsbc" but it is replacing , wth additional ++.

You need to do that in 2 steps: remove the items with 1-letter whole words, and then add + around 2-3 letter words.
Use
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\\b[[:alnum:]]\\b",sample)]
sample <- gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)
data <- data.frame(sample)
data
See the R demo
The sample[!grepl("\\b[[:alnum:]]\\b",sample)] removes the items that contain word boundary (\b), a letter ([[:alnum:]]) and a word boundary pattern.
The gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample) line replaces all 2-3-letter whole words with these words enclosed with +.
Result:
sample
1 commercial funding
2 global +b2b+ banking
3 +how+ +to+ finance commercial +ale+ estate
4 international currency account
5 miami imports banking
6 hsbc supply chain financing
7 international business expansion
8 grow business +in+ +Us+ banking
9 commercial trade Asia Pacific
10 business line +of+ credits hsbc
11 Britain commercial banking
12 +fx+ settlement hsbc
Note that W Hotels and opening a commercial account got filtered out.
Answer to the EDIT
You added some more replacement operations to the code, but you are using literal string replacements, thus, you just need to pass fixed=TRUE argument:
sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)
Else, the + is treated as a regex quantifier, and must be escaped to be treated as a literal plus symbol.
Also, if you need to remove all + from the start of the string, use
sample <- sub("^\\++", "", sample)

Related

Matching word list to individual words in strings using R

Edit: Fixed data example issue
Background/Data: I'm working on a merge between two datasets: one is a list of the legal names of various publicly traded companies and the second is a fairly dirty field with company names, individual titles, and all sorts of other difficult to predict words. The company name list is about 14,000 rows and the dirty data is about 1.3M rows. Not every publicly traded company will appear in the dirty data and some may appear multiple times with different presentations (Exxon Mobil, Exxon, ExxonMobil, etc.).
Accordingly, my current approach is to dismantle the publicly traded company name list into the individual words used in each title (after cleaning out some common words like company, corporation, inc, etc.), resulting in the data shown below as Have1. An example of some of the dirty data is shown below as Have2. I have also cleaned these strings to eliminate words like Inc and Company in my ongoing work, but in case anyone has a better idea than my current approach, I'm leaving the data as-is. Additionally, we can assume there are very few, if any, exact matches in the data and that the Have2 data is too noisy to successfully use a fuzzy match without additional work.
Question: What is the best way to go about determining which of the items in Have2 contains the words from Have1? Specifically, I think I need the final data to look like Want, so that I can then link the public company name to the dirty data name. The plan is to hand-verify the matches given the difficult of the Have2 data, but if anyone has any suggestions on another way to go about this, I am definitely open to suggestions (please, someone, have a suggestion haha).
Tried so far: I have code that sort of works, but takes ages to run and seems inefficient. That is:
library(data.table)
library(stringr)
company_name_data <- c("amazon inc", "apple inc", "radiation inc", "xerox inc", "notgoingtomatch inc")
have1 <- data.table(table(str_split(company_name_data, "\\W+", simplify = TRUE)))[!V1 == "inc"]
have2 <- c("ceo and director, apple inc",
"current title - senior manager amazon, inc., division of radiation exposure, subdivision of corporate anarchy",
"xerox inc., president and ceo",
"president and ceo of the amazon apple assn., division 4")
#Uses Have2 and creates a matrix where each column is a word and each row reflects one of the items from Have2
have3 <- data.table(str_split(have2, "\\W+", simplify = TRUE))
#Creates container
store <- data.table()
#Loops through each of the Have1 company names and sees whether that word appears in the have3 matrix
for (i in 1:nrow(have1)){
matches <- data.table(have2[sapply(1:nrow(have3), function(x) any(grepl(paste0("\\b",have1$V1[i],"\\b"), have3[x,])))])
if (nrow(matches) == 0){
next
}
#Create combo data
matches[, have1_word := have1$V1[i]]
#Storage
store <- rbind(store, matches)
}
Want
Name (from Have2)
Word (from Have1)
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
amazon
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
radiation
vp and general bird aficionado of the amazon apple assn. branch F
amazon
vp and general bird aficionado of the amazon apple assn. branch F
apple
ceo and director, apple inc
apple
xerox inc., president and ceo
xerox
Have1
Word
N
amazon
1
apple
3
xerox
1
notgoingtomatch
2
radiation
1
Have2
Name
ceo and director, apple inc
current title - senior manager amazon, inc., division of microwaves and radiation exposure, subdivision of corporate anarchy
xerox inc., president and ceo
vp and general bird aficionado of the amazon apple assn. branch F
Using what you have documented, in terms of data from company_name_data and have2 only:
library(tidytext)
library(tidyverse)
#------------ remove stop words before tokenization ---------------
# now split each phrase, remove the stop words, rejoin the phrases
# this works through one row at a time** (this is vectorization)
comp2 <- unlist(lapply(company_name_data, # split the phrases into individual words,
# remove stop words then reassemble phrases
function(x) {
paste(unlist(strsplit(x,
" ")
)[!(unlist(strsplit(x,
" ")) %in% (stop_words$word %>%
unlist())
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
haveItAll <- data.frame(have2)
haveItAll$comp <- unlist(lapply(have2,
function(x){
paste(unlist(strsplit(x,
" ")
)[(unlist(strsplit(x,
" ")) %in% comp2
) # end 2nd unlist
], # end subscript of string split
collapse=" ")})) # reassemble string
The results in the second column, based on the text analysis are "apple," "radiation," "xerox," and "amazon apple."
I'm certain this code isn't mine originally. I'm sure I got these ideas from somewhere on StackOverflow...

Extract specific lines of text in r

I have a .txt file with thousands of lines. In this file, I have a meta information about research articles. Every paper has information about Published year (PY), Title (TI), DOI number (DI), Publishing Type (PT) and Abstract (AB). So, the information of almost 300 papers exist in the text file. The format of information about first two article is as follows.
PT J
AU Filieri, Raffaele
Acikgoz, Fulya
Ndou, Valentina
Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
of TripAdvisor, the leading user-generated content (UGC) platform in the
tourism sector. Hence, it is relevant to study the factors that
influence travelers' continued use of TripAdvisor.
Design/methodology/approach - The authors have integrated constructs
from the technology acceptance model, information systems (IS)
continuance model and electronic word of mouth literature. They used
PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
users of TripAdvisor recruited through Prolific.
Findings - Findings reveal that perceived ease of use, online consumer
review (OCR) credibility and OCR usefulness have a positive impact on
customer satisfaction, which ultimately leads to continuance intention
of UGC platforms. Customer satisfaction mediates the effect of the
independent variables on continuance intention.
Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
can benefit from the findings of this study. Specifically, they should
improve the ease of use of their platforms by facilitating travelers'
information searches. Moreover, they should use signals to make credible
and helpful content stand out from the crowd of reviews.
Originality/value - This is the first study that adopts the IS
continuance model in the travel and tourism literature to research the
factors influencing consumers' continued use of travel-based UGC
platforms. Moreover, the authors have extended this model by including
new constructs that are particularly relevant to UGC platforms, such as
performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER
PT J
AU Li, Yelin
Bu, Hui
Li, Jiahong
Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
long-standing interest for economists. We conduct a comprehensive study
of the predictability of investor sentiment, which is measured directly
by extracting expectations from online user-generated content (UGC) on
the stock message board of Eastmoney.com in the Chinese stock market. We
consider the influential factors in prediction, including the selections
of different text classification algorithms, price forecasting models,
time horizons, and information update schemes. Using comparisons of the
long short-term memory (LSTM) model, logistic regression, support vector
machine, and Naive Bayes model, the results show that daily investor
sentiment contains predictive information only for open prices, while
the hourly sentiment has two hours of leading predictability for closing
prices. Investors do update their expectations during trading hours.
Moreover, our results reveal that advanced models, such as LSTM, can
provide more predictive power with investor sentiment only if the inputs
of a model contain predictive information. (C) 2020 International
Institute of Forecasters. Published by Elsevier B.V. All rights
reserved.
CT 14th International Conference on Services Systems and Services
Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER
Now, I want to extract the abstract of each article and store it in the data frame. To extract the abstract I have the following code, which gives me the first match of abstract.
f = readLines("sample.txt")
#extract first match....
pattern <- "AB\\s*(.*?)\\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage\n of TripAdvisor, the leading user-generated content (UGC) platform in the\n tourism sector. Hence, it is relevant to study the factors that\n influence travelers' continued use of TripAdvisor.\n Design/methodology/approach - The authors have integrated constructs\n from the technology acceptance model, information systems (IS)\n continuance model and electronic word of mouth literature. They used\n PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297\n users of TripAdvisor recruited through Prolific.\n Findings - Findings reveal that perceived ease of use, online consumer\n review (OCR) credibility and OCR usefulness have a positive impact on\n customer satisfaction, which ultimately leads to continuance intention\n of UGC platforms. Customer satisfaction mediates the effect of the\n independent variables on continuance intention.\n Practical implications - Managers of UGC platforms (i.e. TripAdvisor)\n can benefit from the findings of this study. Specifically, they should\n improve the ease of use of their platforms by facilitating travelers'\n information searches. Moreover, they should use signals to make credible\n and helpful content stand out from the crowd of reviews.\n Originality/value - This is the first study that adopts the IS\n continuance model in the travel and tourism literature to research the\n factors influencing consumers' continued use of travel-based UGC\n platforms. Moreover, the authors have extended this model by including\n new constructs that are particularly relevant to UGC platforms, such as\n performance heuristics and OCR credibility."
The problem is, I want to extract all the abstracts but the pattern would be different for most of the abstracts. So the specific pattern for all the abstract is that I should extract text starting from AB and every next line having space in the front. Any body can help me in this regard?
You can first group the lines: whenever a line does not start with a space character the group counter is moved up by one.
Then you can aggregate f by group and select the abstracts from the aggregated vector:
group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]
f2[grepl("^AB ", f2)]
A completely different approach. If your text file has the layout you are showing, you could also read everything in a data.frame with readr::read_fwf. When doing this you have all the info from the articles available. You could use tidyr::fill to fill out the missing meta info.
library(dplyr)
library(readr)
articles <- read_fwf("tests/SO text.txt", fwf_empty("tests/SO text.txt", col_names = c("mi", "text")))
articles <- articles %>%
filter(!(is.na(mi) & is.na(text))) # removes empty lines between articles.
articles
# A tibble: 98 x 2
mi text
<chr> <chr>
1 PT J
2 AU Filieri, Raffaele
3 NA Acikgoz, Fulya
4 NA Ndou, Valentina
5 NA Dwivedi, Yogesh
6 TI Is TripAdvisor still relevant? The influence of review credibility,
7 NA review usefulness, and ease of use on consumers' continuance intention
8 SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
9 DI 10.1108/IJCHM-05-2020-0402
10 EA NOV 2020
# ... with 88 more rows
Try it with this regex:
^AB (?:(?!^[A-Z]{2} )([\s\S]))*
PCRE Demo (requires perl=TRUE in R)
If you want to drop the prefix add \K after ^AB \K
You can use
(?m)^AB\h+\K.*(?:\R\h.+)*
See the regex demo. Details:
(?m) - a multiline flag making ^ match at the start of each line
^ - start of a line
AB - an AB substring
\h+ - one or more horizontal whitespaces
\K - match reset operator discard the text matched so far
.* - the rest of the line
(?:\R\h.+)* - zero or more consecutive lines that start with a horizontal whitespace.
In R, you may use it like
x <- as.String(f)
regmatches(x, gregexpr("(?m)^AB\\h+\\K.*(?:\\R\\h.+)*", x, perl=TRUE))

Extract the paragraphs from a PDF that contain a keyword using R

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything.
I have seen this code from a user #Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
However, I need to get the paragraphs and store each one in a row in my database. Could you help me?
The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:
library(textreadr)
doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')
hist(nchar(doc$text), 20)
So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:
doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]
So now we can just do our regex and find the paragraphs with the key word:
grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."
Created on 2020-09-16 by the reprex package (v0.3.0)

counting sentences containing a specific key word in R

UPDATE
Here is what I have done so far.
library(tm)
library(NLP)
library(SnowballC)
# set directory
setwd("C:\\Users\\...\\Data pretest all TXT")
# create corpus with tm package
pretest <- Corpus(DirSource("\\Users\\...\\Data pretest all TXT"), readerControl = list(language = "en"))
pretest is a large SimpleCorpus with 36 elements.
My folder contains 36 txt files.
# check what went in
summary(pretest)
# create TDM
pretest.tdm <- TermDocumentMatrix(pretest, control = list(stopwords = TRUE,
tolower = TRUE, stemming = TRUE))
# convert corpus to data frame
dataframePT <- data.frame(text = unlist(sapply(pretest, `[`, "content")),
stringsAsFactors = FALSE)
dataframePT has 36 observations. So I think until here it is okay.
# load stringr library
library(stringr)
# define sentences
v = strsplit(dataframePT[,1], "(?<=[A-Za-z ,]{10})\\.", perl = TRUE)
lapply(v, function(x) (stringr::str_count(x, "gain")))
My output looks like this
...
[[35]]
[1] NA
[[36]]
[1] NA
So there are actually 36 files, so that's good. But I don't know why it returns NA.
Thank you in advance for any suggestions.
library(NLP)
library(tm)
library(SnowballC)
Load data:
data("crude")
crude.tdm <- TermDocumentMatrix(crude, control = list(stopwords = TRUE, tolower = TRUE, stemming= TRUE))
First convert corpus to data frame
dataframe <- data.frame(text = unlist(sapply(crude, `[`, "content")), stringsAsFactors = F)
one can also inspect the content: crude[[2]]$content
now we need to define a sentence - here I define it with an entity that has at least 10 A-Z or a-z characters mixed with spaces and "," and ending with ".". And I split the documents by that rule using look-behind the .
z = strsplit(dataframe[,1], "(?<=[A-Za-z ,]{10})\\.", perl = T)
but this is not needed for crude corpus since every sentence ends with .\n so one can do:
z = strsplit(dataframe[,1], "\\.n\", perl = T)
I will stick with my previous definition of sentence since one wants it functioning not only for crude corpus. The definition is not perfect so I am keen on hearing your thoughts?
Lets check the output
z[[2]]
[1] "OPEC may be forced to meet before a\nscheduled June session to readdress its production cutting\nagreement if the organization wants to halt the current slide\nin oil prices, oil industry analysts said"
[2] "\n \"The movement to higher oil prices was never to be as easy\nas OPEC thought"
[3] " They may need an emergency meeting to sort out\nthe problems,\" said Daniel Yergin, director of Cambridge Energy\nResearch Associates, CERA"
[4] "\n Analysts and oil industry sources said the problem OPEC\nfaces is excess oil supply in world oil markets"
[5] "\n \"OPEC's problem is not a price problem but a production\nissue and must be addressed in that way,\" said Paul Mlotok, oil\nanalyst with Salomon Brothers Inc"
[6] "\n He said the market's earlier optimism about OPEC and its\nability to keep production under control have given way to a\npessimistic outlook that the organization must address soon if\nit wishes to regain the initiative in oil prices"
[7] "\n But some other analysts were uncertain that even an\nemergency meeting would address the problem of OPEC production\nabove the 15.8 mln bpd quota set last December"
[8] "\n \"OPEC has to learn that in a buyers market you cannot have\ndeemed quotas, fixed prices and set differentials,\" said the\nregional manager for one of the major oil companies who spoke\non condition that he not be named"
[9] " \"The market is now trying to\nteach them that lesson again,\" he added.\n David T"
[10] " Mizrahi, editor of Mideast reports, expects OPEC\nto meet before June, although not immediately"
[11] " However, he is\nnot optimistic that OPEC can address its principal problems"
[12] "\n \"They will not meet now as they try to take advantage of the\nwinter demand to sell their oil, but in late March and April\nwhen demand slackens,\" Mizrahi said"
[13] "\n But Mizrahi said that OPEC is unlikely to do anything more\nthan reiterate its agreement to keep output at 15.8 mln bpd.\"\n Analysts said that the next two months will be critical for\nOPEC's ability to hold together prices and output"
[14] "\n \"OPEC must hold to its pact for the next six to eight weeks\nsince buyers will come back into the market then,\" said Dillard\nSpriggs of Petroleum Analysis Ltd in New York"
[15] "\n But Bijan Moussavar-Rahmani of Harvard University's Energy\nand Environment Policy Center said that the demand for OPEC oil\nhas been rising through the first quarter and this may have\nprompted excesses in its production"
[16] "\n \"Demand for their (OPEC) oil is clearly above 15.8 mln bpd\nand is probably closer to 17 mln bpd or higher now so what we\nare seeing characterized as cheating is OPEC meeting this\ndemand through current production,\" he told Reuters in a\ntelephone interview"
[17] "\n Reuter"
and the original:
cat(crude[[2]]$content)
OPEC may be forced to meet before a
scheduled June session to readdress its production cutting
agreement if the organization wants to halt the current slide
in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy
as OPEC thought. They may need an emergency meeting to sort out
the problems," said Daniel Yergin, director of Cambridge Energy
Research Associates, CERA.
Analysts and oil industry sources said the problem OPEC
faces is excess oil supply in world oil markets.
"OPEC's problem is not a price problem but a production
issue and must be addressed in that way," said Paul Mlotok, oil
analyst with Salomon Brothers Inc.
He said the market's earlier optimism about OPEC and its
ability to keep production under control have given way to a
pessimistic outlook that the organization must address soon if
it wishes to regain the initiative in oil prices.
But some other analysts were uncertain that even an
emergency meeting would address the problem of OPEC production
above the 15.8 mln bpd quota set last December.
"OPEC has to learn that in a buyers market you cannot have
deemed quotas, fixed prices and set differentials," said the
regional manager for one of the major oil companies who spoke
on condition that he not be named. "The market is now trying to
teach them that lesson again," he added.
David T. Mizrahi, editor of Mideast reports, expects OPEC
to meet before June, although not immediately. However, he is
not optimistic that OPEC can address its principal problems.
"They will not meet now as they try to take advantage of the
winter demand to sell their oil, but in late March and April
when demand slackens," Mizrahi said.
But Mizrahi said that OPEC is unlikely to do anything more
than reiterate its agreement to keep output at 15.8 mln bpd."
Analysts said that the next two months will be critical for
OPEC's ability to hold together prices and output.
"OPEC must hold to its pact for the next six to eight weeks
since buyers will come back into the market then," said Dillard
Spriggs of Petroleum Analysis Ltd in New York.
But Bijan Moussavar-Rahmani of Harvard University's Energy
and Environment Policy Center said that the demand for OPEC oil
has been rising through the first quarter and this may have
prompted excesses in its production.
"Demand for their (OPEC) oil is clearly above 15.8 mln bpd
and is probably closer to 17 mln bpd or higher now so what we
are seeing characterized as cheating is OPEC meeting this
demand through current production," he told Reuters in a
telephone interview.
Reuter
You can clean it a bit if you wish, removing the trailing \n but it is not needed for your request.
Now you can do all sorts of things, like:
Which sentences contain the word "gain"
lapply(z, function(x) (grepl("gain", x)))
or the frequency of word "gain" per sentence:
lapply(z, function(x) (stringr::str_count(x, "gain")))
Hi I recommend using filter function from dplyr package and grepl function to search a pattern inside
pattern <- "word1|word2"
df<- df %>%
filter(grepl(pattern,column_name)
The df would be limited to only those matching that condition. So then just use nrow function to count how many rows last :)
Example:
a1<-1:10
a2<-11:20
(data<-data.frame(a1,a2,stringsAsFactors = F))
a1 a2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
(data<-data %>% filter(grepl("5|7",data$a2)))
a1 a2
1 5 15
2 7 17
(nrow(data))
[1] 2

error reading text file into new columns of a dataframe using some text editing

I have a text file (0001.txt) which contains the data as below:
<DOC>
<DOCNO>1100101_business_story_11931012.utf8</DOCNO>
<TEXT>
The Telegraph - Calcutta (Kolkata) | Business | Local firms go global
6 Local firms go global
JAYANTA ROY CHOWDHURY
New Delhi, Dec. 31: Indian companies are stepping out of their homes to try their luck on foreign shores.
Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09.
Though the first-quarter investment was 15 per cent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
According to analysts, confidence in global recovery, cheap corporate buys abroad and easier rules governing investment overseas had spurred flow of capital and could see total investment abroad top $12 billion this year and rise to $18-20 billion next fiscal.
For example, Titagarh Wagons plans to expand abroad on the back of the proposed Asian railroad project.
We plan to travel all around the world with the growth of the railroads, said Umesh Chowdhury of Titagarh Wagons.
India is full of opportunities, but we are all also looking at picks abroad, said Gautam Mitra, managing director of Indian Structurals Engineering Company.
Mitra plans to open a holding company in Switzerland to take his business in structurals to other Asian and African countries.
Indian companies created 3 lakh jobs in the US, while contributing $105 billion to the US economy between 2004 and 2007, according to commerce ministry statistics. During 2008-09, Singapore, the Netherlands, Cyprus, the UK, the US and Mauritius together accounted for 81 per cent of the total outward investment.
Bose said, And not all of it is organic growth. Much of our investment abroad reflects takeovers and acquisitions.
In the last two years, Suzlon acquired Portugals Martifers stake in German REpower Systems for $122 million. McNally Bharat Engineering has bought the coal and minerals processing business of KHD Humboldt Wedag. ONGC bought out Imperial Energy for $2 billion.
Indias foreign assets and liabilities today add up to more than 60 per cent of its gross domestic product. By the end of 2008-09, total foreign investment was $67 billion, more than double of that at the end of March 2007.
</TEXT>
</DOC>
Above, all text data is within the HTML code for text i.e.
<TEXT> and </TEXT>.
I want to read it into an R dataframe in a way that there will be four columns and the data should be read as:
Title Author Date Text
The Telegraph - Calcutta (Kolkata) JAYANTA ROY CHOWDHURY Dec. 31 Indian companies are stepping out of their homes to try their luck on foreign shores. Corporate India invested $2.7 billion abroad in the first quarter of 2009-2010 on top of $15.9 billion in 2008-09. Though the first-quarter investment was 15 percent lower than what was invested in the same period last year, merchant banker Sudipto Bose said, It marks a confidence in a new world order where Indian businesses see themselves as equal to global players.
What I was trying to read using dplyr and as shown below:
# read text file
library(dplyr)
library(readr)
dat <- read_csv("0001.txt") %>% slice(-8)
# print part of data frame
head(dat, n=2)
In above code, I tried to skip first few lines (which are not important) from the text file that contains the above text and then read it into dataframe.
But I could not get what I was looking for and got confused what I am doing is wrong.
Could someone please help?
To be able to read data into R as a data frame or table, the data needs to have a consistent structure maintained by separators. One of the most common formats is a file with comma separated values (CSV).
The data you're working with doesn't have separators though. It's essentially a string with minimally enforced structure. Because of this, it sounds like the question is more related to regular expressions (regex) and data mining than it is to reading text files into R. So I'd recommend looking into those two things if you do this task often.
That aside, to do what you're wanting in this example, I'd recommend reading the text file into R as a single string of text first. Then you can parse the data you want using regex. Here's a basic, rough draft of how to do that:
fileName <- "Path/to/your/data/0001.txt"
string <- readChar(fileName, file.info(fileName)$size)
df <- data.frame(
Title=sub("\\s+[|]+(.*)","",string),
Author=gsub("(.*)+?([A-Z]{2,}.*[A-Z]{2,})+(.*)","\\2",string),
Date=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+(.*)","\\2",string),
Text=gsub("(.*)+([A-Z]{1}[a-z]{2}\\.\\s[0-9]{1,2})+[: ]+(.*)","\\3",string))
Output:
str(df)
'data.frame': 1 obs. of 4 variables:
$ Title : chr "The Telegraph - Calcutta (Kolkata)"
$ Author: chr "JAYANTA ROY CHOWDHURY"
$ Date : chr "Dec. 31"
$ Text : chr "Indian companies are stepping out of their homes to"| __truncated__
The reason why regex can be useful is that it allows for very specific patterns in strings. The downside is when you're working with strings that keep changing formats. That will likely mean some slight adjustments to the regex used.
read.table( file = ... , sep = "|") will solve your issue.

Resources