I want to record the job posting information from this search. Is anyone aware of an API or can you confirm it's possible to scrape with Python beautiful soup? (I'm familiar with scraping, I just can't see how to get this website)
Disclosure: I work at SerpApi.
You can use google-search-results package to get data from Google Jobs listings. Check a demo at Repl.it.
from serpapi import GoogleSearch
params = {
"engine": "google_jobs",
"q": "sustainability jobs in mi",
"google_domain": "google.com",
"api_key":
"API_KEY"
}
client = GoogleSearch(params)
data = client.get_dict()
print("Job results")
for job_result in data['jobs_results']:
print(f"""Title: {job_result['title']}
Company name: {job_result['company_name']}
Description: {job_result['description']}
""")
print("Filters")
for chip in data['chips']:
print(f"Type: {chip['type']}\n")
print("Options")
for option in chip['options']:
print(option['text'])
Response
{
"jobs_results":[
{
"title":"Sustainability Analyst",
"company_name":"Amcor",
"location":"Ann Arbor, MI",
"via":"via LinkedIn",
"description":"Amcor Limited Job Posting\n\nRole: Sustainability Analyst\n\nLocation: TBD, ideally in the US (Ann Arbor, MI)\n\nAbout Amcor\n\nAmcor (ASX: AMC;\n\nAmcor is proud of its recent pledge to design all of our packaging to be recyclable or reusable by 2025. The job holder will play a very important and exciting role in Amcor’s journey to deliver this important commitment.\n\nPosition Overview\n\nRead more about Amcor’s sustainability commitment:\n\nThe Sustainability function plays a key role in positioning Amcor as THE leading packaging company for the environment delivering on Amcor’s sustainability strategy, the 2025 pledge and as a supplier of choice for responsible packaging.\n\nThe Sustainability Analyst is responsible for analyzing, reporting, and coordinating selected global Sustainability activities with direction from the VP Sustainability.\n\nEssential Responsibilities And Duties\n• Track legislative activity, analyze for risk and opportunity, help to prioritize actions\n• Assist with drafting... positions, coordinate Amcor activity and governance around advocacy (mostly in industry group participation)\n• Assists with internal reporting and communications, including preparing decks for internal meetings\n• Partnership administration, tracking projects and payments, and liaising with corporate finance on dept budget\n• Manage compliance statements, including anti-slavery statements, conflict minerals etc.\n• Coordinates the International Costal Cleanup, as needed with other partners\n• Other similar duties as required to support the corporate sustainability program\n\nQualifications\n• Education: Master's Degree or equivalent in related field preferred\n• Three to five years of experience\n• Strong analytical skills, including ability to interpret and graphically display environmental performance data\n• Excellent written and verbal communications skills\n• Excellent working knowledge of Microsoft Office\n• Demonstrated professional work characteristics including high initiative, dependability, and ability to manage confidential information\n• Must be well organized and comfortable interfacing with all levels of management\nAmcor Leadership Framework Competencies\n• Drive for Results\n• Influencing Others\n• Customer Focus\n• Learning on the Fly\n• Interpersonal Savvy\n• Organizational Awareness\n• Priority Setting\n• Organizing\n• Functional / Technical Skills\n• Strong Computer Skills\n\nRelationships\n• Amcor Leadership\n• Direct Reports\n• External Vendors\n• Government agencies\n• Global partners/ Nonprofit organizations\n• Industry organizations\nExpected Travel: 10% Travel\n\nThe information contained herein is not intended to be an all-inclusive list of the duties and responsibilities of the job, nor are they intended to be an all-inclusive list of the skills and abilities required to do the job.\n\n#North America",
"extensions":[
"Over 1 month ago",
"Full-time"
]
},
{
"title":"Environmental Jobs in Michigan,USA",
"company_name":"freelancejobopenings.com",
"location":"Michigan",
"via":"via Freelance Job Openings",
"description":"Environmental Jobs in Michigan,USA\n\nSummer Camp Instructor\n\nenvironmental learning center at barr lake state park with a satellite office in fort collins and fieldwork outposts in environmental science, leadership, and or outdoor adventure programs for diverse audiences in formal and non formal outdoor and classroom environmental studies, biological sciences, natural resource management, or related field, with a focus in ornithology.\n\n strong summer, birding, camp, education, colorado, outdoors, teaching\n\nwebsite: barefoot student summer camp\n\nSITE LEAD\n\nenvironmental changes, and sudden work schedule changes.\n• tech savvy: frito lay is an industry leader site: fritolay the site lead is accountable for ensuring the building is operating at top performance to deliver the zone sops strategy and ensures a safe working environment. the role requires cross functional understanding in order to drive operations success.\n\nwe are open 24 hours a day, which means\n\nField Service ... Chromatography Spectrometry Instruments - Grand Rapids, MI\n\nenvironmental testing, and forensic toxicology looking to hire field service engineer to support lcms and gcms platforms. travel to client labs to perform calibrations, diagnose problems with equipment field service chromatography spectrometry instruments grand rapids, mi\n\nleader in liquid chromatography mass spectrometry and gas chromatography mass spectrometry, supporting clinical research, drug discovery, food and environmental testing, and forensic toxicology looking to hire field service engineer to support\n\nUTA Test Engineer\n\nenvironmental demands may be referenced in an attempt to municate the manner in which this position traditionally is performed. about capgemini:\n\na global leader in consulting, technology services and digital transformation, capgemini is at the forefront of innovation to address the entire breadth of clients’ opportunities in the evolving world of cloud, digital and platforms. building on its strong 50 year heritage and deep industry specific expertise, capgemini enables organizations to realize\n\nIndustrial Water/Wastewater Design Engineer\n\nenvironmental, civil, or chemical\n• 4+ years of industrial water wastewater system environmental, civil or chemical\n• water wastewater treatment design experience in variety industrial markets\n• experience with biological and physical chemical treatment design build experience\n\nwhat we offer engineering water wastewater\n\nbusiness line design and consulting services group (dcs)\n\ncountry",
"extensions":[
"13 hours ago",
"Full-time"
]
}
]
}
If you want more information, check out SerpApi documentation.
Related
I am trying to remove "\r\n-" in a text which I extracted from a PDF file using readtext() from readtext package in R Studio. Below is my code in R:
library(readtext)
jd <- readtext("C:/Users/HomeUser/Documents/Sales Manager.pdf")
jd_text <- jd$text
jd_text2 <- gsub(pattern = "\r\n-?|•", replacement = " ", jd_text)
Below is the original extracted text jd_text:
"Sales Manager\r\nCFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with\r\nlarge enterprises in their Digital Transformation journey and help them and their employees thrive\r\nin the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision\r\nis to make work enjoyable. For more information, please visit www.cfb-bots.com.\r\nWe are looking for a high performing frontrunner to blaze the trail and make new connections for\r\nour growing business. As a Sales Manager, you will play a vital role in keeping the Company\r\ncompetitive by achieving our customer acquisition and revenue growth targets. You will be the key\r\nliaison in every stage of the sales process, from planning to closing the sales.\r\nIf you are passionate about technology and are motivated by a hunger to solve our clients’\r\nchallenges, read on to find out more.\r\nYou can gain:\r\n− Incentive for achieving sales targets\r\n− Exposure to the latest industry trends and technologies\r\n− Endless learning and growth opportunities\r\n− Sharpen sales planning, analytical and management skills\r\n− Flexible work-life benefits\r\nYou will do:\r\nSales Strategy\r\n- Develop ..."
I was able to remove many "\r\n-" in jd_text using gsub(). Output from jd_text2 below:
"Sales Manager CFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with large enterprises in their Digital Transformation journey and help them and their employees thrive in the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision is to make work enjoyable. For more information, please visit www.cfb-bots.com. We are looking for a high performing frontrunner to blaze the trail and make new connections for our growing business. As a Sales Manager, you will play a vital role in keeping the Company competitive by achieving our customer acquisition and revenue growth targets. You will be the key liaison in every stage of the sales process, from planning to closing the sales. If you are passionate about technology and are motivated by a hunger to solve our clients’ challenges, read on to find out more. You can gain: − Incentive for achieving sales targets − Exposure to the latest industry trends and technologies − Endless learning and growth opportunities − Sharpen sales planning, analytical and management skills − Flexible work-life benefits You will do: Sales Strategy Develop ..."
As you can see, I was able to remove "\r\n-" occurring after "Flexible work-life benefits" while "-" from those first few "\r\n-" still remained. However, when I pasted the original text extract directly from the display of jd_text in R Studio console into a new variable jd_test, applied gsub() again, I was able to accomplish my goal:
jd_test <- "Sales Manager\r\nCFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with\r\nlarge enterprises in their Digital Transformation journey and help them and their employees thrive\r\nin the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision\r\nis to make work enjoyable. For more information, please visit www.cfb-bots.com.\r\nWe are looking for a high performing frontrunner to blaze the trail and make new connections for\r\nour growing business. As a Sales Manager, you will play a vital role in keeping the Company\r\ncompetitive by achieving our customer acquisition and revenue growth targets. You will be the key\r\nliaison in every stage of the sales process, from planning to closing the sales.\r\nIf you are passionate about technology and are motivated by a hunger to solve our clients’\r\nchallenges, read on to find out more.\r\nYou can gain:\r\n− Incentive for achieving sales targets\r\n− Exposure to the latest industry trends and technologies\r\n− Endless learning and growth opportunities\r\n− Sharpen sales planning, analytical and management skills\r\n− Flexible work-life benefits\r\nYou will do:\r\nSales Strategy\r\n- Develop ..."
jd_test2 <- gsub(pattern = "\r\n-?|•", replacement = " ", jd_test)
Output from jd_test2:
Sales Manager CFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with large enterprises in their Digital Transformation journey and help them and their employees thrive in the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision is to make work enjoyable. For more information, please visit www.cfb-bots.com. We are looking for a high performing frontrunner to blaze the trail and make new connections for our growing business. As a Sales Manager, you will play a vital role in keeping the Company competitive by achieving our customer acquisition and revenue growth targets. You will be the key liaison in every stage of the sales process, from planning to closing the sales. If you are passionate about technology and are motivated by a hunger to solve our clients’ challenges, read on to find out more. You can gain: Incentive for achieving sales targets Exposure to the latest industry trends and technologies Endless learning and growth opportunities Sharpen sales planning, analytical and management skills Flexible work-life benefits You will do: Sales Strategy Develop ..."
Anyone has any idea what is the problem and how do I go about it? I have tried using another function pdf_text() from pdftools package but it yielded the same frustrating result. At first I thought "-" for the first few "\r\n-" is slightly longer than the latter ones but the direct copy-paste attempt seems to contradict this observation. Is there something "hidden" in the object which is not migrated during the copy-paste action? Any suggestions is greatly appreciated!
I found a likely answer to my question. It seems the original extracted text from the PDF document is not in an encoding that R Studio could recognise. This would explain why for the first few "-"s were not removed. After I apply jd_text <-iconv(jd_text,"UTF-8") to coerce the encoding to UTF-8, my problem was solved, and I am able to remove "\r\n-" completely.
I am trying to extract textbook names, and other journal articles in syllabi collected from various courses using R. My basic assumption is that most of these will be in some kind of a citation format (e.g. APA, MLA, etc). While I can try to create regex-s to extract this information, I was wondering if anyone has tried to do this before, or if an R package exists that I may be able to use to extract this information from differently formatted text.
Below are two examples of the syllabi that I am working with. In Sample 1, the book name is not in a citation format, but in sample 2, it is in a citation format. Both samples have been truncated to meet stackoverflow character limits.
SAMPLE 1:
"ABC State University ARTS 3366 Intermediate Digital Photography Fall 2015 JCM 4127 T/TH 24:30 pm Lecturer: John Smith Office Hours: T/TH prior to and after class Email: johnsmith#abcstate.edu Alternate email: johnsmith#gmail.com Prerequisites: ARTS 3364 Introduction to Digital Photography Course Description & Objectives: This course is designed to expand and build on the skills and knowledge acquired in Introduction to Digital Photography. This course builds on the skills and knowledge acquired in Introduction to Digital Photography. Specifically, we will use the history, critical analysis, and production of photography books to: (1) explore the complexities of the medium in social, political, and aesthetic contexts; (2) develop more advanced and conceptually driven photography work; (3) work toward a greater understanding of how photography books function as selfcontained art, cultural, and political objects; (4) learn how to choose subject matter and continually explore, experiment, and refine our work. The final outcomes ofthe class will be the creation of an ondemand book and an accompanying folio of fine prints. We will use digital cameras, inkjet printers, Adobe Photoshop, Lightroom, and Macintosh computers in this course. Through lectures, discussions and readings, we will explore and discuss historical trends in traditional (analog) photography, as well as emerging practices in contemporary digital imaging. This will serve as a foundation to help determine the approach, subject matter, and style of the work created for class. In addition to refining these skills, students will also address the practical and theoretical roles of digital imagery. The course objective will be to focus on technical, aesthetic, and conceptual growth of a student’s endeavors in the digital medium. This course requires the completion of: all assignments (on time), participation in all group critiques and completion of a Twelve to Fifteen image final portfolio of prints or equivalent, and three projects throughout the semester. Requirements: Coursework: This course requires the completion of: all assignments (on time), participation in all group critiques and completion of a 1215 image final portfolio of prints or equivalent, the creation of a book printed with an on demand printing service, as well as making new photographs consistently throughout the entire semester. Suggested (not required)Books: Adobe Photoshop Lightroom 5 Book, The: The Complete Guide for Photographers By Martin Evening Published Jun 30, 2013 by Adobe Press The Photographer’s Playbook 307 Assignments and Ideas Edited by Jason Fulford and Gregory Halpern Published by Aperture On Being a Photographer: A Practical Guide by David Hurn and Bill Jay Local Stores:"
SAMPLE 2:"Physical Education Activity ProgramHealth & Fitness Strength TrainingKINE 198-837Instructor: JANE DOE Office: PEAP 230Office Hours: By appointmentPhone: (000) 000-0000E-Mail: jdoe#xyz.edu A. Activity Instructor: Jane DoeOffice: PEAP 250Office Hours: By appointmentClass Time: Thursday 2:20 pmPhone: (000) 000-0000Email: jdoe#xyz1.edu Class Meeting Site: PEAP 117B. Activity Instructor: Jane Doe Phone:Office: PEAP 239Email: jdoe#xyz1.eduOffice Hours: Thursday 10:00 am – 12:00 pmClass Time: Thursday 2:20 pmClass Meeting Site: PEAP 118C. Activity Instructor: John doe Office: PEAP 250/Doe 213KOffice Hours: Tuesday 1:00-2:00 pmClass Time: Thursday 2:20 pmPhone:Email: johndoe#xyz.eduClass Meeting Site: PEAP 120Attire: Proper clothes and shoes designed specifically for strength training on activitydays.Required Materials:Bounds, L., Agnor, D., Darnell,G., & Brekken Shea, K. (2012). Health & Fitness: AGuide to a Healthy Lifestyle (5th edition). Dubuque, IA: Kendall/Hunt Publishing Co.ISBN 978-1-4652-0712-8Cissik, J. (2001). The Basics of Strength Training (3rd Edition). McGraw-Hill,Primus Custom PublishingCourse Description:Health and Fitness is intended for the student who is seeking knowledge and practicalapplication of wellness choices to their life. The course consists of two components,lecture and activity. Students will meet face-to-face one day per week for the activityportion of the class and work approximately the equivalent of one day per week onlinewith lecture materials. The lecture portion will cover current health issues includingmental and physical health, nutrition, human sexuality, communicable and noncommunicable diseases, use and abuse of drugs, and safety. The activity portion willconsist of 14 class days and cover basic knowledge and techniques of strength trainingand improving the individual’s fitness through the utilization of this knowledge.Course Rationale:Research indicates that daily health/fitness related behaviors enhance learning anddetermine the quality and longevity of our life."
I have a dataframe:
free_text
"Lead Software Engineer Who We Are: CareerBuilder is the global leader in human capital solutions as we help people target and attract their most important asset - their people. From candidate sourcing solutions, to comprehensive workforce data, to software that streamlines your recruiting process, our focus is always about making your recruitment strategy simple, fast and effective. Are you an experienced software engineer looking to take the next step to leadership? Would you like to lead a team of agile software developers? If so, then we have an immediate need for a self-motivated software engineering lead to join the Candidate Data Processing team in our Norcross, Georgia office. The Candidate Data Processing team is responsible for processing and enriching millions of candidate profiles. We use the Amazon AWS ecosystem as well as our own in-house platform to enhance, normalize, and index candidate profiles from a variety of sources. Our projects require scalable solutions with continuous availability. CareerBuilder engineers participate in every phase of the software development lifecycle and are encouraged to have vision beyond the technical aspects of a project. This position requires knowledge in the theory and practical application of object-oriented design and programming. Prior leadership experience and experience with databases and cloud-computing technologies are desired. Your primary responsibilities as an Engineering Lead will be split between management and technical contributions. You will work with an agile project manager and a product owner to establish objectives and results, and you will lead a team of 3 to 5 software engineers to meet those objectives in a sustainable process. Some of the technologies your team will be using include: AWS (Lambda, SNS, S3, EC2, SQS, DynamoDB, etc.) Java or .net (Java, C#, VB.Net) Unit testing (Junit, MSTest, Moq) Relational databases (SQL) Web services (REST APIs, JSON, RestSharp) Git/github Linux (bash, cron) Job Requirements What we need from you: A passion for technology and bringing your visions to reality through code and leveraging state of the art technologies As a lead, you will take ownership of issues and challenges and will also be a proactive and effective communicator; this role requires successful verbal and written communication to many different audiences inside and outside of Careerbuilder Demonstrated ability to earn your teammates' trust and respect through clear, honest, and helpful communication We prefer you to have proven leadership experience, but also be a hands on, passionate coder BS in Computer Science or related field (preferred but not required) What you will receive: When you're focused on the goal, not the path - you can be more flexible, and that translates into more productive and satisfied employees. From flexible hours to volunteering during work hours to diverse education opportunities, CareerBuilder.com is committed to helping employees strike a balance. Training that positions you to continuously grow with ongoing learning and development courses; we never stop investing in our people. Summer Hours! Enjoy 1/2 day paid Fridays during Summer Hours Quarterly 24 hour Hackathons and bi-weekly personal development time to learn new skills Paid volunteer time and coordinated opportunities to give back to the community Bagel Fridays! Casual Dress Code and laid back environment; don't worry about buying new suits and dry cleaning bills! Comprehensive Medical, Dental & Vision Programs Education Reimbursement Program allowing up to $5k per year towards completion of a Bachelor's and non-MBA graduate degree, and up to $10K per year towards completion of an MBA! No strings attached! $400 Annual Reimbursement for Wellness Activities, including your gym membership! 401(k) Program with Strong Employer Match and 2 year vesting schedule! Five Star Company Paid Trips for top performers, pack your bags and get ready to experience luxury! CareerBuilder, LLC is proud to be an Equal Opportunity Employer. Applicants are considered for all positions without regard to race, color, religion, sex, national origin, age, disability, sexual orientation, ancestry, marital or veteran status."
"Quality Engineer TSS is currently seeking Quality Engineer for Industrial Manufacturer in the London, KY area. Qualified candidates must have experience in Quality Engineering or related degree. Job Requirements Directs sampling inspection, and testing of produced/received parts, components and materials to determine conformance to standards. Host customers for audits, react to customer complaints, follow through on all sorting and rework of suspect parts. Control of the product sorting/hold areas of the facility. Responsible for directing, instructing and organizing the work of parts sort area. Must follow-up with efficiency, effectiveness and safety of those assigned to work the area. Provides training and completes documentation of all quality training provided to Company employees and forwarding that paperwork to the appropriate individuals (Supervisors, Engineering, Human Resources, etc.). Develop PPAP documentation for specific products; including Quality Control Plans, Flowcharts, FMEA’s, Inspection Reports, measurement/calculations coordination and PSW. Acts as Internal Auditor Coordinator and oversees the maintenance of all TS 16949 documentation. Applies statistical process control (SPC) methods for analyzing data to evaluate the current process and process changes. Works with supervisors and other responsible persons on determining root cause and developing corrective actions for all internal quality concerns. Participate in APQP for specific programs. Communicate with the customer as necessary to ensure all issues around assigned programs are resolved in a timely manner. Respond to customer corrective Action Requests. Develop gauging requirements for assigned programs. Monitor process capability to ensure required standards are maintained. Participate in Continuous Improvement programs. Perform workstation audits on assigned programs. Perform vendor quality audits as required. Prepares and presents technical and program information to team members and management. Accepts responsibility for subordinates?activities; Solicits and applies customer feedback (internal and external); Fosters quality focus in others. Provides computerized status report describing progress and concerns related to inspection activities, nonconforming items, and/or other items related to the quality of the process, material, or product. Reviews quality trends, tracks the root cause of problems, and coordinates correction actions. Provides input and recommendations to management on process of procedural system improvements, such as configuration management and operations functions. Work with technicians to ensure products are measured correctly and all data is compiled for on-time PPAP submissions. Will document and review supplier quality issues to the quality files daily, and communicate any needed Corrective Actions or plans from the suppliers. Formulates contingency plans, reviews control plans and FMEAs and makes necessary updates to the database as needed. Responsibilities include training; assigning and directing work of temporary re-work employees. All other duties as assigned. Training: TS 16949 Documentation: APQP, PPAP, FMEA, MSA Internal Auditing Education Requirements: College degree or equivalent experience as determined by the Quality Manager. Skills: To perform this job successfully, an individual must be able to perform each essential job functions satisfactory. The duties and responsibilities listed above are representative of the knowledge, skill and/or ability required for the position. Excellent verbal and written skills: Proficient in computer software including Word, Excel, Access: Strong leadership skills: Good problem solving skills; Communicate well with others at all levels. Experience: To perform this position successfully, an individual should have a minimum of three (3) years in related field. "
An I try to test this code:
library(tidytext)
library(stringr)
reg <- "([^A-Za-z_\\d##']|'(?![A-Za-z_\\d##]))"
tidy_df <- df %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text,
"https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https",
"")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
But I receive this error:
Error in stri_detect_regex(string, pattern, opts_regex = opts(pattern)) :
argument `str` should be a character vector (or an object coercible to)
Is there any problem with the input data and this error happens? What can I do to fix it?
You forgot to load dplyr (library(dplyr)). This causes R to use stats::filter() rather than dplyr::filter(). The former function has a different signature and does not expose free_text to the inner str_detect().
Basically I was looking to use stripe to take online payments for an online lottery website however the application is marked as a prohibited business.
Prohibited Businesses: Gambling
Lotteries; bidding fee auctions; sports forecasting or odds making; fantasy sports leagues with cash prizes; internet gaming; contests; sweepstakes; games of chance
Alternative Options??
I was looking for another option instead of stripe that would take online payments for my application.
It is a startup business so i would like the payment option to handle the merchant bank account side like stripe/paypal.
The project is being developed on asp.net, web forms c#.
Any advice would be greatly appreciated.
Most countries are regulating gambling of any form.
A few examples:
some countries like France have a company dealing with such purpose under the authority of the government.
for US, gambling regulation is different by state, and some don't even allow Lottery at all.
in Ireland, latest laws allows online gambling, prior to acquire a license delivered by the state. Not having this license can cost up to €300,000 as a fine.
There is a good chance that your Lottery application will fall under the same regulation, in which case you have to contact whichever authority in your country to ask them how you can create a gambling application under required law, if permitted (keeping in mind that this could be a pretty tedious and long process).
Bottom line of your question:
Stripe or other online payment systems are not allowing these types of payments because of this regulation.
Even if passing the barrier of regulation, a lot of technical restrictions would have to be applied to verify people residence to avoid legal issues.
UPDATE:
One option as mentioned in comments would be to use Bitcoin (using it with ASP.NET) as an alternative money to circumvent legal issues, but that doesn't mean that this is not regulated yet or going to be in a near future (which falls legally under a Lacuna).
I am trying to scrape and parse the following RSS feed http://www.nestle.com/_handlers/rss.ashx?q=068f9d6282034061936dbe150c72d197. I have no problem to extract the basic items that I need (e.g., title, description, pubDate) using the following code:
library(RCurl)
library(XML)
xml.url <- "http://www.nestle.com/_handlers/rss.ashx?q=068f9d6282034061936dbe150c72d197"
script <- getURL(xml.url)
doc <- xmlParse(script)
titles <- xpathSApply(doc,'//item/title',xmlValue)
descriptions <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)
My problem is that the output for item "description" includes not only the actual text but also a lot of style formatting expressions. For example, the first element is:
descriptions[1]
[1] "<p><iframe height=\"322\" src=\"https://www.youtube-nocookie.com/embed/fhESDXnlMa0?rel=0\" frameBorder=\"0\" width=\"572\"></iframe><br />\n<br />\n<p><em>Nescafé</em> is partnering with Facebook to launch an immersive video, pioneering new technology just released for the platform.</p>\n<p>\nThe <em>Nescafé</em> <a class=\"externalLink\" title=\"Opens in a new window: Nescafé on Facebook\" href=\"https://www.facebook.com/Nescafe/videos/vb.203900255471/10156233581755472/?type=2&theater\" target=\"_blank\">‘Good Morning World’ video</a> stars people in kitchens across the world, performing the hit song ‘Don’t Worry’ using spoons, cups, forks and a jar of coffee. Uniquely, viewers can rotate their smartphones through 360˚ to explore the video, the first time this has been possible on Facebook.</p>\n<p>\n“We know young coffee lovers pick up their phone at the start of every day looking to be entertained by real experiences. The 360˚ video allows us to be engaging in an innovative way,” said Carsten Fredholm, Senior Vice President of Nestlé’s Beverage Strategic Business Unit.\n</p>\n<p><em>Nescafé</em> recently teamed up with Google to offer the first virtual reality coffee experience through the <em>Nescafé 360˚</em> app. It also became the first global brand to move its website onto Tumblr, to strengthen connections with younger fans by allowing them to create and share content.</p>\n<p>The Nestlé brand is one of only six globally to partner Facebook for the launch of this technology.</p></p>"
I can think of a regex approach to replace the unwanted character strings. However, is there a way to access the plain text elements of item "description" directly through xpath?
Any help with this issue, is very much appreciated. Thank you.
You can do:
descriptions <- sapply(descriptions, function(x) {
xmlValue(xmlRoot(htmlParse(x)))
}, USE.NAMES=FALSE)
which gives (via cat(stringr::str_wrap(descriptions[[1]], 70)):
In a move that will provide young Europeans increased access to
jobs and training opportunities, Nestlé and the Alliance for YOUth
have joined the European Pact for Youth as founding members. Seven
million people in Europe under the age of 25 are still inactive -
neither in employment, education or training. The European Pact for
Youth, created by European CSR business network CSR Europe and the
European Commission, aims to work together with businesses, youth
organisations, education providers and other stakeholders to reduce
skills gaps and increase youth employability. As part of the Pact, the
Alliance for YOUth will focus on setting up âdual learningâ schemes
across Europe, combining formal education with apprenticeships and on-
the-job training to help match skills with jobs on the market. The
Alliance for YOUth is a group of almost 200 companies mobilised by
Nestlé to help young people in Europe find work. It has pledged to
create 100,000 employability opportunities by 2017 and has already met
half of this target in its first year. Luis Cantarell, Executive Vice
President for Nestlé and co-initiator of the European Pact for Youth,
said: âPromoting a cultural shift to dual learning schemes based on
business-education collaboration is at the heart of Nestléâs youth
employment initiative since its start in 2013. The European Pact for
Youth will help to build a skilled workforce and will tackle youth
unemployment.â Learn more about the European Pact for Youth and read
their press release.
There are \n characters at various points in the resultant text (in almost all the descriptions) but you can gsub those away.