Extract relevant text from a .txt file in R - r

I am still on a basic beginner level with r. I am currently working on some natural language stuff and I use the ProQuest Newsstand database. Even though the database allows to download txt files, I don't need everything they provide. The files you can download there look like this:
###############################################################################
____________________________________________________________
Report Information from ProQuest 16 July 2016 09:58
____________________________________________________________
____________________________________________________________
Inhaltsverzeichnis
1. Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY
____________________________________________________________
Dokument 1 von 1
Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY
http:...
Kurzfassung: Savills said that as part of its plans to build...
Links: ...
Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...
Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910
Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY:   [FIRST Edition]
Autor: Steve Pain Commercial Property Editor
Titel der Publikation: Birmingham Post
Seiten: 30
Seitenanzahl: 0
Erscheinungsjahr: 2007
Publikationsdatum: Aug 2, 2007
Jahr: 2007
Bereich: Business
Herausgeber: Mirror Regional Newspapers
Verlagsort: Birmingham (UK)
Publikationsland: United Kingdom
Publikationsthema: General Interest Periodicals--Great Britain
Quellentyp: Newspapers
Publikationssprache: English
Dokumententyp: NEWSPAPER
ProQuest-Dokument-ID: 324215031
Dokument-URL: ...
Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)
Zuletzt aktualisiert: 2010-06-19
Datenbank: UK Newsstand
____________________________________________________________
Kontaktieren Sie uns unter: http... Copyright © 2016 ProQuest LLC. Alle Rechte vorbehalten. Allgemeine Geschäftsbedingungen: ...
###############################################################################
What I need is a way to extract only the full text to a csv file. The reason is, when I download hundreds of articles within one file it is quite difficult to copy and paste them manually and I think the file is quite structured. However, the length of text varies. Nevertheless, one could use the next header after the full text as a stop sign (I guess).
Is there any way to do this?
I really would appreciate some help.
Kind regards,
Steffen

Lets say you have all publication information in a single text file make a copy of your file for reset first. Using Notepad++ and RegEx you'd go through following steps:
Ctrl+F
Choose the Mark tab.
Search mode: Regular expression
Find what: ^Volltext:\s
Alt+M to check Bookmark line (if unchecked only)
Click on Mark All
From the main menu go to: Search > Bookmark > Remove Unmarked Lines
In a third step go through following steps:
Ctrl+H
Search mode: Regular expression
Find what: ^Volltext:\s (choose from dropdown)
Replace with: NOTHING (clear text field)
Click on Replace All
Done ...

Try this out:
con <- file("./R/sample text.txt")
content <- paste(readLines(con),collapse="\n")
content <- gsub(pattern = "\\n\\n", replacement = "\n", x = content)
close(con)
content.filtered <- sub(pattern = "(.*)(Volltext:.*?)(_{10,}.*)",
replacement = "\\2", x=content)
Results:
> cat(content.filtered)
Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...
Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910
Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY: [FIRST Edition]
Autor: Steve Pain Commercial Property Editor
Titel der Publikation: Birmingham Post
Seiten: 30
Seitenanzahl: 0
Erscheinungsjahr: 2007
Publikationsdatum: Aug 2, 2007
Jahr: 2007
Bereich: Business
Herausgeber: Mirror Regional Newspapers
Verlagsort: Birmingham (UK)
Publikationsland: United Kingdom
Publikationsthema: General Interest Periodicals--Great Britain
Quellentyp: Newspapers
Publikationssprache: English
Dokumententyp: NEWSPAPER
ProQuest-Dokument-ID: 324215031
Dokument-URL: ...
Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)
Zuletzt aktualisiert: 2010-06-19
Datenbank: UK Newsstand

Related

How to extract data from HTML tags using python

I am trying to extract data from resumes. Im using pypandoc to convert docx to HTML.Below is the code which I used.
HTML file obtained is as below.
Can someone explain how to extract Work Histroy from this?
import pypandoc
output = pypandoc.convert_file('E:/cvparser/backupresumes/xyz.docx', 'html', outputfile="E:/cvparser/abc.html")
assert output == ""
print(output)
Here is the html file:
<p>PROFILE SUMMARY</p>
<ul>
<li><p>4 years of experience working in corporate environment as a full stack developer. Strong technical skills in complex website development including web based application.</p></li>
<li><p>ERP application development & enhancement, service delivery and client relationship management in education and industrial domain.</p></li>
</ul>
<p>EDUCATION</p>
<p>MCA (Master of Computer Applications) from CMR Institute of Management Studies – Bangalore University with 78%</p>
<p>BCA (Bachelor of Computer Applications) from Shri SVK College of Business and Management studies - Gulbarga University with 74%.</p>
<p>TECHNICAL SKILLS</p>
<p>Web Technologies: HTML/HTML5, CSS, JavaScript, Ajax, JSON, Apache, Bootstrap.</p>
<p>WORK HISTORY</p>
<ul>
<li><p>Leviosa Consulting Pvt Ltd from Feb 2015 to till date as a sr. Software Developer.</p></li>
<li><p>DRDO – Defence Research and Development Organization from Nov 2014 to Feb 2015 as a contract engineer.</p></li>
</ul>
<p>PROJECTS</p>
<p><strong>I1ERP – Manufacturing Industry</strong></p>
<p>Technologies Used: PHP, MySQL, HTML, CSS, Ajax, Bootstrap, Angular 6.</p>
<p>Duration: 1 Year.</p>
<ul>
<li><p>I1ERP is a fully custom designed application software which itself builds another application without writing code.</p></li>
<li><p>Anyone having knowledge of computer can use this app and build application based on the user requirements.</p></li>
<li><p>This automate and streamline business processes with greater adoptability.</p></li>
<li><p>I1ERP integrates all facets of an operation including product planning, manufacturing, sales, invoice, marketing and Human Resource.</p></li>
</ul>
This software has modules like Document Mgmt., Reminder System, Checklist System, Work Tracking System and Password Mgmt.</p>
<p>PERSONAL DETAILS</p>
<p>Date of Birth: 5<sup>th</sup> Feb 1990</p>
<p>Marital Status: Unmarried</p>
<p>Nationality: Indian</p>
<p>Languages Known: English, Kannada, Telugu and Hindi.</p>
Can someone explain how to extract Work History from this?
Here is one possible solution using beautifulsoup. Variable data contains the HTML text from the question:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for tag in soup.select('p:contains("WORK HISTORY") ~ *:not(p:contains("WORK HISTORY") ~ p, p:contains("WORK HISTORY") ~ p ~ *)'):
print(tag.get_text(strip=True, separator='\n'))
Prints:
Leviosa Consulting Pvt Ltd from Feb 2015 to till date as a sr. Software Developer.
DRDO – Defence Research and Development Organization from Nov 2014 to Feb 2015 as a contract engineer.
import pypandoc
from bs4 import BeautifulSoup
output = pypandoc.convert_file('E:/cvparser/backupresumes/Bapuray.docx', 'html', outputfile="E:/cvparser/Bap.html")
assert output == ""
with open('E:/cvparser/Bap.html') as report:
raw = report.readlines()
str = """""".join(raw)
#print(str)
soup = BeautifulSoup(str, 'html.parser')
for tag in soup.select('p:contains("WORK HISTORY") ~ *:not(p:contains("WORK HISTORY") ~ p, p:contains("WORK HISTORY") ~ p ~ *)'):
print(tag.get_text(strip=True, separator='\n'))
I got the below error:
NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type

Extracting full article text via the newsanchor package [in R]

I am using the newsanchor package in R to try to extract entire article content via NewsAPI. For now I have done the following :
require(newsanchor)
results <- get_everything(query = "Trump +Trade", language = "en")
test <- results$results_df
This give me a dataframe full of info of (maximum) a 100 articles. These however do not containt the entire actual article text. Rather they containt something like the following:
[1] "Tensions between China and the U.S. ratcheted up several notches over the weekend as Washington sent a warship into the disputed waters of the South China Sea. Meanwhile, Google dealt Huaweis smartphone business a crippling blow and an escalating trade war co… [+5173 chars]"
Is there a way to extract the remaining 5173 chars. I have tried to read the documentation but I am not really sure.
I don't think that is possible at least with free plan. If you go through the documentation at https://newsapi.org/docs/endpoints/everything in the Response object section it says :
content - string
The unformatted content of the article, where available. This is truncated to 260 chars for Developer plan users.
So all the content is restricted to only 260 characters. However, test$url has the link of the source article which you can use to scrape the entire content but since it is being aggregated from various sources I don't think there is one automated way to do this.

R extract percentage of entries out of textfile using readlines

Hi I have a very large txt-file (character) where I want to extract 10% of the entries and save those to another txt-file.
con1 <- file("ABC.txt", "rb") # 2,36 mio DS
dfc1<-readLines(con1, ??? ,skipNul = TRUE)#
Instead of ??? I want to have something like <10% of all data> .
So If my ABC.txt was like
" BBC Worldwide is a principle commercial arm and a wholly owned subsidiary of the British Broadcasting Corporation (BBC). The business exists to support the BBC public service mission and to maximise profits on its behalf..."
my new file should contain only 10% (random) of the words like:
" Worldwide business behalf..."
Is there a way to do that in R ?
Thank you
If you read in the text file, you can then use the stringr package to get a 10% random sample of the words using the following code:
text<- c("BBC Worldwide is a principle commercial arm and a wholly owned subsidiary of the British Broadcasting Corporation (BBC). The business exists to support the BBC public service mission and to maximise profits on its behalf...")
set.seed(9999)
library(stringr)
selection<-sample.int(str_count(text," ")+1, round(0.1*str_count(text," ")+1))
subset<-word(text, selection)

Serving a vCard (.vcf) file through Iron Router

I'm trying to wrap my head around how I can deliver a file through Iron Router. Here is what I am trying to accomplish:
1) User opens URL like http://website.com/vcard/:_id
2) Meteor generates vCard file
BEGIN:VCARD
VERSION:3.0
N:Gump;Forrest;;Mr.
FN:Forrest Gump
ORG:Bubba Gump Shrimp Co.
TITLE:Shrimp Man
PHOTO;VALUE=URL;TYPE=GIF:http://www.example.com/dir_photos/my_photo.gif
TEL;TYPE=WORK,VOICE:(111) 555-1212
TEL;TYPE=HOME,VOICE:(404) 555-1212
ADR;TYPE=WORK:;;100 Waters Edge;Baytown;LA;30314;United States of America
LABEL;TYPE=WORK:100 Waters Edge\nBaytown\, LA 30314\nUnited States of Ameri
ca
ADR;TYPE=HOME:;;42 Plantation St.;Baytown;LA;30314;United States of America
LABEL;TYPE=HOME:42 Plantation St.\nBaytown\, LA 30314\nUnited States of Ame
rica
EMAIL;TYPE=PREF,INTERNET:forrestgump#example.com
REV:2008-04-24T19:52:43Z
END:VCARD
3) User gets .vcf file and it runs on their phone, Outlook, etc.
Thanks!
it has little to do with iron router. You need something that can return simple text file. Here is a demo which kind of does that:
http://meteorpad.com/pad/TbjQfAnmTAFQcyZ5a/Leaderboard

RSS feed for gas prices and how to interpret the feed

I am trying to add a RSS feed of gas prices based on location to my application.
I googled for RSS feed for gas prices and bumped onto Motortrend's gas price feed
http://www.motortrend.com/widgetrss/gas-
The feed seems to be fine, but the price value seem to be depicted in alphabets as below
Chevron 3921 Irvine Blvd, Irvine, CA 92602 (0.0 miles)
Monday, May 10, 2010 9:16 AM
Regular: ZEIECHK Plus: ZEHGIHC Premium: ZEGJEGE Diesel: N/A
How do I interpret these value to come up with a value for the gas price? Or is it internal to Motortrend's and cannot be used elsewhere?
View the source of Vista Motortrend Sidebar by downloading file, renaming file extension to .zip. Then unzip source and look at /js/gas.js file. You will find a js function called parsePrice(). It is basically a character conversion to find the Unicode value, and some simple math.

Resources