I am struggling to replace stringA in a column Description with stringB if stringA contains stringC - r

id,Description
1,DRIVING VEHICLE ON HIGHWAY WITHOUT CURRENT REGISTRATION PLATES AND VALIDATION TABS
2,EXCEEDING SPEED LIMIT IN SCHOOL ZONE WITH 35 MPH IN 25 ZONE
3,DRIVING VEHICLE IN EXCESS OF REASONABLE AND PRUDENT SPEED ON HIGHWAY 8855
If I want to replace whole Description row with "EXCESS SPEED" if it contains "SPEED".
I am using gsub function :
training_data$Description = gsub("(dot)(star)SPEED(dot)(star)","EXCESS SPEED",training_data$Description,fixed = TRUE)
It is still not working I am getting result (just replacing "SPEED" with "EXCESS SPEED" instead of the entire row) :
Output Received:
id,Description
1,DRIVING VEHICLE ON HIGHWAY WITHOUT CURRENT REGISTRATION PLATES AND VALIDATION TABS
2,EXCEEDING EXCESS SPEED LIMIT IN SCHOOL ZONE WITH 35 MPH IN 25 ZONE
3,DRIVING VEHICLE IN EXCESS OF REASONABLE AND PRUDENT EXCESS SPEED ON HIGHWAY 8855
Expected Output:
id,Description
1,DRIVING VEHICLE ON HIGHWAY WITHOUT CURRENT REGISTRATION PLATES AND VALIDATION TABS
2,EXCESS SPEED
3,EXCESS SPEED

You can try this:
training_data$Description = sapply(training_data[,2],function(sentence){
if(grepl("SPEED",sentence)){
as.character("EXCESS SPEED")
}else{
as.character(sentence)
}
})

Related

How to find duplicate using Lat Long data and make it a Unique Identifier in big dataset

My Dataset looks something like this. Note below is hypothetical dataset.
Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information
Sr.No.
Store_Name
Phone-No.
Agent_id
Area
Lat-Long
1
ABC Stores
89099090
121
Bay Area
23.909090,89.878798
2
Wuhan Masks
45453434
122
Santa Fe
24.452134,78.123243
3
Twitter Cafe
67556090
123
Middle East
11.889766,23.334483
4
abc
33445569
121
Santa Cruz
23.345678,89.234213
5
Silver Gym
11004110
234
Worli Sea Link
56.564311, 78.909087
6
CK Clothings
00908876
223
90 th Street
34.445887, 12.887654
Facts:
#1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same
In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet
a) Since Name is entered manually for same house/store names can be changed and entered in the system -
multiple visits can happen
b) Mobile number can also be manipulated, different number can be associated with same outlet
c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building
Problem:
How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.
Deploying QR is not also very helpful as this can also be tweaked.
Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)
Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made

Extract the paragraphs from a PDF that contain a keyword using R

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything.
I have seen this code from a user #Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
However, I need to get the paragraphs and store each one in a row in my database. Could you help me?
The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:
library(textreadr)
doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')
hist(nchar(doc$text), 20)
So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:
doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]
So now we can just do our regex and find the paragraphs with the key word:
grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."
Created on 2020-09-16 by the reprex package (v0.3.0)

extract the first number after a specific word in a text column [duplicate]

This question already has answers here:
Extract number after a certain word
(4 answers)
Closed 2 years ago.
I have some text data and I want to extract from it the first number after the word "expects earnings of". What I currently have is the following:
x <- d %>%
mutate(
expectsEarningsOf = str_match_all(newCol, "expects earnings of (.*?) cents")
)
Which extracts the text along with the number after the word "expects earnings of" and before the word "cents". I just want to now extract the first number after "expects earnings of". I thought about something:
x <- d %>%
mutate(
expectsEarningsOf = str_match_all(newCol, "expects earnings of (.*?) anyStringCharacter")
)
Where anyStringCharacter is any non numeric number.
Data:
d <- structure(list(grp = c(2635L, 1276L, 10799L, 10882L, 6307L, 7622L,
2448L, 6467L, 3224L, 2064L, 9232L, 5039L, 2888L, 5977L, 3565L
), newCol = c("For 2008, True Religion expects earnings of $1.48 to $1.52 a share and net sales of $210 million to $215 million. The company expects to incur additional marketing expenses of about $1.7 million. ",
"But Hospira also said it now expects net sales on a GAAP basis to grow at a rate of 1% to 2% this year, reduced from earlier expectations by lower-than-expected international sales and purchasing delays in the medication-management business. After the second quarter, the company had projected growth in a range of 3% to 5%. ",
"14 Nov 2013 16:04 EDT *Thermogenesis Sees Net Savings About $1.5 Million From Reorganization",
" The Company announced that net sales for this nine week period increased by 25.4% to $185.3 million while comparable store sales for this period decreased by 0.5%. Based on this quarter-to-date performance, the Company now expects net sales for the fourth quarter of fiscal 2013 to be in the range of $208 million to $210 million, comparable store sales to be in the range of -1.5% to -0.5% and GAAP net income to be in the range of $23.3 million to $24.3 million, with a GAAP diluted income per common share range of $0.43 to $0.45 on approximately 54.0 million estimated weighted average shares outstanding. Excluding $0.9 million, or $0.02 per adjusted diluted share in tax-effected expenses related to the founders' transaction(1) , adjusted net income is expected to be approximately $24.2 million to $25.2 million, or $0.44 to $0.46 per diluted share based on estimated adjusted diluted weighted average shares outstanding of approximately 54.6 million., 9 Jan 2014 16:45 EDT *Five Below, Inc. Updates 4Q Fiscal 2013 Guidance Based On Qtr-To-Date Results",
"", "1323 GMT Raiffeisen Centrobank calls Verbund's (VER.VI) recent guidance increase for 2014 a \"mixed bag,\" raising its target price to EUR15.60 from EUR14.30. The bank retains its hold rating as positive effects are mostly due to one-offs, although the utility's sustainable cost savings were a positive surprise. \"The power price environment is still bleak following a weakish outlook for Central European economies, coal prices falling further and only lacklustre hopes for a quick fix of the European energy and climate policy,\" Raiffeisen adds. Verbund's shares trade up 0.6% at EUR15.34. (Nicole.lundeen#wsj.com; #nicole_lundeen) ",
"As a result of its third quarter results and current fourth quarter outlook, the Company has updated its guidance for fiscal 2007. The Company now expects net sales to range from $2.68 billion to $2.7 billion, which compares to prior expectations of $2.7 billion to $2.75 billion. Same-store sales for the year are expected to increase approximately 2.5% to 3% compared to previous expectations of an increase of approximately 3.0% to 4.5%. The Company now expects full year net income to range from $2.37 to $2.43 per diluted share, which compares to its prior guidance of $2.49 to $2.56 per diluted share. ",
" Sempra Energy (SRE) sees earnings next year growing 15% from this year's estimate, putting 2010 expectations above Wall Street's, as the parent of San Diego Gas & Electric anticipates much lower capital spending for the next five years.",
"Outlook for 2008: Midpoint for EPS guidance increased, For the full year 2008, the company now expects results from continuing operations as follows: earnings per diluted share of between $3.10 and $3.20, compared to the previous range of $3.00 to $3.20; revenue growth of approximately 9%, and operating income to approach 17% of revenues. Over the same period, the company expects cash from operations to approximate $900 million and capital expenditures of between $240 million and $260 million. These estimates exclude potential special charges.",
"California Pizza Kitchen expects second-quarter earnings of 34 cents to 36 cents a share. Wall Street expects earnings of 36 cents a share. ",
" -- Q1 2013 gross margin within guidance, sales ahead of guidance , \"We achieved first quarter sales ahead of and gross margin in line with our guidance, and reiterate our expectation for a sales acceleration during the year, with a second quarter markedly stronger than the first quarter and a large second half, leading to expected 2013 full year net sales at a similar level to that of 2012. The underlying assumptions are unchanged, with foundry and logic preparing for very lithography-intensive 14-20 nm technology nodes to be used for next generation mobile end-products; while lithography investments in memory are still muted, memory chip price recovery and discussions on scanner shipment capability are signs of potential upside for second half deliveries. EUV technology industrialization continues to make steady progress on the trajectory set with the introduction of the improved source concept last year: firstly, the EUV light sources have now been demonstrated at 55 Watts with adequate dose control; secondly, the scanners themselves have demonstrated production-worthy, 10 nm node compatible imaging and overlay specifications. We therefore confirm our expectation of the ramp of EUV-enabled semiconductor production in 2015, supported by our NXE:3300B scanners, two of which are being prepared for shipment and installation in Q2 and Q3,\" said Eric Meurice, President and Chief Executive Officer of ASML., -- For the second quarter of 2013, ASML expects net sales of about EUR 1.1 ",
"In the first quarter, Covanceexpects earnings of 60 cents a share on a modest sequential increase in net revenues. Analysts predicted income of 66 cents share on $534 million in revenue, which is nearly flat with the latest quarter's revenue.",
"The company said Monday it expects to report revenue of about $875 million for 2007, up sharply from $196 million in 2006, mostly because of new military contracts. However, it expects net income to remain nearly the same at $16.6 million. ",
"For the fourth quarter, the company sees earnings of $1.13 to $1.16 a share. ",
"Chip maker now expects earnings from continuing operations of 15c-17c a share, excluding restructuring charges, and a revenue decline of 25% to 30% sequentially, because of weak demand. Shares fall 6% late., Chip maker now expects earnings from continuing operations of 15c-17c a share, excluding restructuring charges, and a revenue decline of 25% to 30% sequentially, because of weak demand. Shares fall 6% late."
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-15L))
The first number after "expects earnings of":
library(stringr)
str_extract_all(d$newCol, "(?<=expects earnings of )\\d+")
This solution uses positive lookbehind in (?<=expects earnings of ), encoding an instruction to match \\d+if it is immediately preceded by expects earnings of (with a white space).

How to have all elements of text file processed in R

I have a single text file, NPFile, that contains 100 different newspaper articles that is 3523 lines in length. I am trying to pick out and parse different data fields for each article for text processing. These fields are: Full text: Publication date:, Publication title: etc....
I am using grep to pick out the different lines that contain the data fields I want. Although I can get the line numbers (start and end positions of the fields), I am getting an error when I try to use the line numbers to extract the actual text and put it into a vector:
#Find full text of article, clean and store in a variable
findft<-grep ('Full text:', NPFile, ignore.case=TRUE)
endft<-grep ('Publication date:', NPFile)
ftfield<-(NPFile[findft:endft])
The last line ftfield<-(NPFile[findft:endft] is giving this warning message:
1: In findft:endft :
numerical expression has 100 elements: only the first used
The starting findft and ending points endft each contain 100 elements, but as the warning indicated, ftfield only contains the first element (which is 11 lines in length). I was assuming (wrongly/mistakenly) that the respective lines for each 100 instances of the full text field would be extracted and stored in ftfield - but obviously I have not coded this correctly. Any help would be appreciated.
Example of Data (These are the fields and data associated with one of the 100 in the text file):
Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.
Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology.
A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question.
The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible.
But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. "There isn't really a hundred-year event anymore," states climatologist Tom Karl of the National Oceanic and Atmospheric Administration.
Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself.
What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972.
Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace.
The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested.
Pub Date: 4/22/97
Publication date: Apr 22, 1997
Publication title: The Sun; Baltimore, Md.
Title: Waiting for the 500-year flood; Red River rampage: Severe weather events, new records are more frequent than expected.:   [FINAL Edition ]
From this data example above, ftfield has 11 lines when I examined it:
[1] "Full text: AS THE RED River raged over makeshift dikes futilely erected against its wrath in North Dakota, drowning cities beneath a column of water 26 feet above flood level, meteorologists were hard pressed to describe its magnitude in human chronology."
[2] "A 500-year flood, some call it, a catastrophic weather event that would have occurred only once since Christopher Columbus arrived on the shores of the New World. Whether it could be termed a 700-year flood or a 300-year flood is open to question."
[3] "The flood's size and power are unprecedented. While the Red River has ravaged the upper Midwest before, the height of the flood crest in Fargo and Grand Forks has been almost incomprehensible."
[4] "But climatological records are being broken more rapidly than ever. A 100-year-storm may as likely repeat within a few years as waiting another century. It is simply a way of classifying severity, not the frequency. \"There isn't really a hundred-year event anymore,\" states climatologist Tom Karl of the National Oceanic and Atmospheric Administration."
[5] "Reliable, consistent weather records in the U.S. go back only 150 years or so. Human development has altered the Earth's surface and atmosphere, promoting greater weather changes and effects than an untouched environment would generate by itself."
[6] "What might be a 500-year event in the Chesapeake Bay is uncertain. Last year was the record for freshwater gushing into the bay. The January 1996 torrent of melted snowfall into the estuary recorded a daily average that exceeded the flow during Tropical Storm Agnes in 1972, a benchmark for 100-year meteorological events in these parts. But, according to the U.S. Geological Survey, the impact on the bay's ecosystem was not as damaging as in 1972."
[7] "Sea level in the Bay has risen nearly a foot in the past century, three times the rate of the past 5,000 years, which University of Maryland scientist Stephen Leatherman ties to global climate warming. Estuarine islands and upland shoreline are eroding at an accelerated pace."
[8] "The topography of the bay watershed is, of course, different from that of the Red River. It's not just flow rates and rainfall, but how the water is directed and where it can escape without intruding too far onto dry land. We can only hope that another 500 years really passes before the Chesapeake region is so tested."
[9] "Pub Date: 4/22/97"
[10] ""
[11] "Publication date: Apr 22, 1997"
And, lastly, findft[1] corresponds with endft[1] and so on until findft[100] and endft[100].
I'll assume that findft will contain several indexes as well as endft. I'm also assuming that both of them have the same length and that they are paired by the same index ( e.g. findft[5] corresponds to endft[5]) and that you want all NPfile elements between these two indexes as well as the other pairs.
If this is so, try:
ftfield = lapply(1:length(findft), function(x){ NPFile[findft[x]:endft[x]] })
This will return a list. I can't guarantee that this will work because there is no data example to work with.
We can do this with Map. Get the sequence of values for each corresponding element of 'findft' to 'endft', then subset the 'NPFile' based on that index
Map(function(x, y) NPFile[x:y], findft, endft)

readPDF (tm package) in R

I tried to read some online pdf document in R. I used readRDF function. My script goes like this
safex <- readPDF(PdftotextOptions='-layout')(elem=list(uri='C:/Users/FCG/Desktop/NoteF7000.pdf'),language='en',id='id1')
R showed the message that running command has status 309. I tried different pdftotext options. however, it is the same message. and the text file created has no content.
Can anyone read this pdf
readPDF has bugs and probably isn't worth bothering with (check out this well-documented struggle with it).
Assuming that...
you've got xpdf installed (see here for details)
your PATHs are all in order (see here for details of how to do that) and you've restarted your computer.
Then you might be better off avoiding readPDF and instead using this workaround:
system(paste('"C:/Program Files/xpdf/pdftotext.exe"',
'"C:/Users/FCG/Desktop/NoteF7000.pdf"'), wait=FALSE)
And then read the text file into R like so...
require(tm)
mycorpus <- Corpus(URISource("C:/Users/FCG/Desktop/NoteF7001.txt"))
And have a look to confirm that it went well:
inspect(mycorpus)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Market Notice
Number: Date F7001 08 May 2013
New IDX SSF (EWJG) The following new IDX SSF contract will be added to the list and will be available for trade today.
Summary Contract Specifications Contract Code Underlying Instrument Bloomberg Code ISIN Code EWJG EWJG IShares MSCI Japan Index Fund (US) EWJ US EQUITY US4642868487 1 (R1 per point)
Contract Size / Nominal
Expiry Dates & Times
10am New York Time; 14 Jun 2013 / 16 Sep 2013
Underlying Currency Quotations Minimum Price Movement (ZAR) Underlying Reference Price
USD/ZAR Bloomberg Code (USDZAR Currency) Price per underlying share to two decimals. R0.01 (0.01 in the share price)
4pm underlying spot level as captured by the JSE.
Currency Reference Price
The same method as the one utilized for the expiry of standard currency futures on standard quarterly SAFEX expiry dates.
JSE Limited Registration Number: 2005/022939/06 One Exchange Square, Gwen Lane, Sandown, South Africa. Private Bag X991174, Sandton, 2146, South Africa. Telephone: +27 11 520 7000, Facsimile: +27 11 520 8584, www.jse.co.za
Executive Director: NF Newton-King (CEO), A Takoordeen (CFO) Non-Executive Directors: HJ Borkum (Chairman), AD Botha, MR Johnston, DM Lawrence, A Mazwai, Dr. MA Matooane , NP Mnxasana, NS Nematswerani, N Nyembezi-Heita, N Payne Alternate Directors: JH Burke, LV Parsons
Member of the World Federation of Exchanges
Company Secretary: GC Clarke
Settlement Method
Cash Settled
-
Clearing House Fees -
On-screen IDX Futures Trading: o 1 BP for Taker (Aggressor) o Zero Booking Fees for Maker (Passive) o No Cap o Floor of 0.01 Reported IDX Futures Trades o 1.75 BP for both buyer and seller o No Cap o Floor of 0.01
Initial Margin Class Spread Margin V.S.R. Expiry Date
R 10.00 R 5.00 3.5 14/06/2013, 16/09/2013
The above instrument has been designated as "Foreign" by the South African Reserve Bank
Should you have any queries regarding IDX Single Stock Futures, please contact the IDX team on 011 520-7399 or idx#jse.co.za
Graham Smale Director: Bonds and Financial Derivatives Tel: +27 11 520 7831 Fax:+27 11 520 8831 E-mail: grahams#jse.co.za
Distributed by the Company Secretariat +27 11 520 7346
Page 2 of 2

Resources