Fuzzy Matching a paragraph within a large text - information-retrieval

I need to do a pretty complex matching of phrases.
I have large bodies of text in files which exceed 1000 words each.
The phrases I am searching for (searchphrase) are like this:
Investment does not mean:
i. Claims to money that arise solely from:
1. Commercial contracts for the sale of goods or
services by a national or an enterprise of a party
to an enterprise in the territory of the other party,
or
2. The extension of credit in connection with a
commercial transaction, such as trade financing
other than loans or claims to money previously
covered.
I want to know if the phrase occurs in each of the files I have. However, the files will not have content that are exact replicas of the phrase. Instead the file (textfile) will be a large document with a paragraph like:
But investment does not mean claims to money derived solely from
commercial transactions designed exclusively for the sale of goods or
services by a national or legal person in the territory of one
Contracting Party to a national or legal person in the territory of the
other Contracting Party, credits to finance commercial transactions such
as trade financing, and other credits with a duration of less than three
years, as well as credits granted to the State or to a State enterprise.
As you can see, searchphrase is pretty similar in actual meaning to this paragraph from textfile. There is also considerable overlap in the keywords. Hence, I should get a match.
What sort of algorithm should I try and use to code this? Are pre-coded modules available anywhere that do this job?

Related

Character vector using tidyr

I have a dataframe:
free_text
"Lead Software Engineer Who We Are: CareerBuilder is the global leader in human capital solutions as we help people target and attract their most important asset - their people. From candidate sourcing solutions, to comprehensive workforce data, to software that streamlines your recruiting process, our focus is always about making your recruitment strategy simple, fast and effective. Are you an experienced software engineer looking to take the next step to leadership? Would you like to lead a team of agile software developers? If so, then we have an immediate need for a self-motivated software engineering lead to join the Candidate Data Processing team in our Norcross, Georgia office. The Candidate Data Processing team is responsible for processing and enriching millions of candidate profiles. We use the Amazon AWS ecosystem as well as our own in-house platform to enhance, normalize, and index candidate profiles from a variety of sources. Our projects require scalable solutions with continuous availability. CareerBuilder engineers participate in every phase of the software development lifecycle and are encouraged to have vision beyond the technical aspects of a project. This position requires knowledge in the theory and practical application of object-oriented design and programming. Prior leadership experience and experience with databases and cloud-computing technologies are desired. Your primary responsibilities as an Engineering Lead will be split between management and technical contributions. You will work with an agile project manager and a product owner to establish objectives and results, and you will lead a team of 3 to 5 software engineers to meet those objectives in a sustainable process. Some of the technologies your team will be using include: AWS (Lambda, SNS, S3, EC2, SQS, DynamoDB, etc.) Java or .net (Java, C#, VB.Net) Unit testing (Junit, MSTest, Moq) Relational databases (SQL) Web services (REST APIs, JSON, RestSharp) Git/github Linux (bash, cron) Job Requirements What we need from you: A passion for technology and bringing your visions to reality through code and leveraging state of the art technologies As a lead, you will take ownership of issues and challenges and will also be a proactive and effective communicator; this role requires successful verbal and written communication to many different audiences inside and outside of Careerbuilder Demonstrated ability to earn your teammates' trust and respect through clear, honest, and helpful communication We prefer you to have proven leadership experience, but also be a hands on, passionate coder BS in Computer Science or related field (preferred but not required) What you will receive: When you're focused on the goal, not the path - you can be more flexible, and that translates into more productive and satisfied employees. From flexible hours to volunteering during work hours to diverse education opportunities, CareerBuilder.com is committed to helping employees strike a balance. Training that positions you to continuously grow with ongoing learning and development courses; we never stop investing in our people. Summer Hours! Enjoy 1/2 day paid Fridays during Summer Hours Quarterly 24 hour Hackathons and bi-weekly personal development time to learn new skills Paid volunteer time and coordinated opportunities to give back to the community Bagel Fridays! Casual Dress Code and laid back environment; don't worry about buying new suits and dry cleaning bills! Comprehensive Medical, Dental & Vision Programs Education Reimbursement Program allowing up to $5k per year towards completion of a Bachelor's and non-MBA graduate degree, and up to $10K per year towards completion of an MBA! No strings attached! $400 Annual Reimbursement for Wellness Activities, including your gym membership! 401(k) Program with Strong Employer Match and 2 year vesting schedule! Five Star Company Paid Trips for top performers, pack your bags and get ready to experience luxury! CareerBuilder, LLC is proud to be an Equal Opportunity Employer. Applicants are considered for all positions without regard to race, color, religion, sex, national origin, age, disability, sexual orientation, ancestry, marital or veteran status."
"Quality Engineer TSS is currently seeking Quality Engineer for Industrial Manufacturer in the London, KY area. Qualified candidates must have experience in Quality Engineering or related degree. Job Requirements Directs sampling inspection, and testing of produced/received parts, components and materials to determine conformance to standards. Host customers for audits, react to customer complaints, follow through on all sorting and rework of suspect parts. Control of the product sorting/hold areas of the facility. Responsible for directing, instructing and organizing the work of parts sort area. Must follow-up with efficiency, effectiveness and safety of those assigned to work the area. Provides training and completes documentation of all quality training provided to Company employees and forwarding that paperwork to the appropriate individuals (Supervisors, Engineering, Human Resources, etc.). Develop PPAP documentation for specific products; including Quality Control Plans, Flowcharts, FMEA’s, Inspection Reports, measurement/calculations coordination and PSW. Acts as Internal Auditor Coordinator and oversees the maintenance of all TS 16949 documentation. Applies statistical process control (SPC) methods for analyzing data to evaluate the current process and process changes. Works with supervisors and other responsible persons on determining root cause and developing corrective actions for all internal quality concerns. Participate in APQP for specific programs. Communicate with the customer as necessary to ensure all issues around assigned programs are resolved in a timely manner. Respond to customer corrective Action Requests. Develop gauging requirements for assigned programs. Monitor process capability to ensure required standards are maintained. Participate in Continuous Improvement programs. Perform workstation audits on assigned programs. Perform vendor quality audits as required. Prepares and presents technical and program information to team members and management. Accepts responsibility for subordinates?activities; Solicits and applies customer feedback (internal and external); Fosters quality focus in others. Provides computerized status report describing progress and concerns related to inspection activities, nonconforming items, and/or other items related to the quality of the process, material, or product. Reviews quality trends, tracks the root cause of problems, and coordinates correction actions. Provides input and recommendations to management on process of procedural system improvements, such as configuration management and operations functions. Work with technicians to ensure products are measured correctly and all data is compiled for on-time PPAP submissions. Will document and review supplier quality issues to the quality files daily, and communicate any needed Corrective Actions or plans from the suppliers. Formulates contingency plans, reviews control plans and FMEAs and makes necessary updates to the database as needed. Responsibilities include training; assigning and directing work of temporary re-work employees. All other duties as assigned. Training: TS 16949 Documentation: APQP, PPAP, FMEA, MSA Internal Auditing Education Requirements: College degree or equivalent experience as determined by the Quality Manager. Skills: To perform this job successfully, an individual must be able to perform each essential job functions satisfactory. The duties and responsibilities listed above are representative of the knowledge, skill and/or ability required for the position. Excellent verbal and written skills: Proficient in computer software including Word, Excel, Access: Strong leadership skills: Good problem solving skills; Communicate well with others at all levels. Experience: To perform this position successfully, an individual should have a minimum of three (3) years in related field. "
An I try to test this code:
library(tidytext)
library(stringr)
reg <- "([^A-Za-z_\\d##']|'(?![A-Za-z_\\d##]))"
tidy_df <- df %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text,
"https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https",
"")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
But I receive this error:
Error in stri_detect_regex(string, pattern, opts_regex = opts(pattern)) :
argument `str` should be a character vector (or an object coercible to)
Is there any problem with the input data and this error happens? What can I do to fix it?
You forgot to load dplyr (library(dplyr)). This causes R to use stats::filter() rather than dplyr::filter(). The former function has a different signature and does not expose free_text to the inner str_detect().

Online Payment System - Lottery Application (Stripe Prohibited Businesses) alternative?

Basically I was looking to use stripe to take online payments for an online lottery website however the application is marked as a prohibited business.
Prohibited Businesses: Gambling
Lotteries; bidding fee auctions; sports forecasting or odds making; fantasy sports leagues with cash prizes; internet gaming; contests; sweepstakes; games of chance
Alternative Options??
I was looking for another option instead of stripe that would take online payments for my application.
It is a startup business so i would like the payment option to handle the merchant bank account side like stripe/paypal.
The project is being developed on asp.net, web forms c#.
Any advice would be greatly appreciated.
Most countries are regulating gambling of any form.
A few examples:
some countries like France have a company dealing with such purpose under the authority of the government.
for US, gambling regulation is different by state, and some don't even allow Lottery at all.
in Ireland, latest laws allows online gambling, prior to acquire a license delivered by the state. Not having this license can cost up to €300,000 as a fine.
There is a good chance that your Lottery application will fall under the same regulation, in which case you have to contact whichever authority in your country to ask them how you can create a gambling application under required law, if permitted (keeping in mind that this could be a pretty tedious and long process).
Bottom line of your question:
Stripe or other online payment systems are not allowing these types of payments because of this regulation.
Even if passing the barrier of regulation, a lot of technical restrictions would have to be applied to verify people residence to avoid legal issues.
UPDATE:
One option as mentioned in comments would be to use Bitcoin (using it with ASP.NET) as an alternative money to circumvent legal issues, but that doesn't mean that this is not regulated yet or going to be in a near future (which falls legally under a Lacuna).

Bypass Style Formatting when Parsing RSS Feed in R

I am trying to scrape and parse the following RSS feed http://www.nestle.com/_handlers/rss.ashx?q=068f9d6282034061936dbe150c72d197. I have no problem to extract the basic items that I need (e.g., title, description, pubDate) using the following code:
library(RCurl)
library(XML)
xml.url <- "http://www.nestle.com/_handlers/rss.ashx?q=068f9d6282034061936dbe150c72d197"
script <- getURL(xml.url)
doc <- xmlParse(script)
titles <- xpathSApply(doc,'//item/title',xmlValue)
descriptions <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)
My problem is that the output for item "description" includes not only the actual text but also a lot of style formatting expressions. For example, the first element is:
descriptions[1]
[1] "<p><iframe height=\"322\" src=\"https://www.youtube-nocookie.com/embed/fhESDXnlMa0?rel=0\" frameBorder=\"0\" width=\"572\"></iframe><br />\n<br />\n<p><em>Nescafé</em> is partnering with Facebook to launch an immersive video, pioneering new technology just released for the platform.</p>\n<p>\nThe <em>Nescafé</em> <a class=\"externalLink\" title=\"Opens in a new window: Nescafé on Facebook\" href=\"https://www.facebook.com/Nescafe/videos/vb.203900255471/10156233581755472/?type=2&theater\" target=\"_blank\">‘Good Morning World’ video</a> stars people in kitchens across the world, performing the hit song ‘Don’t Worry’ using spoons, cups, forks and a jar of coffee. Uniquely, viewers can rotate their smartphones through 360˚ to explore the video, the first time this has been possible on Facebook.</p>\n<p>\n“We know young coffee lovers pick up their phone at the start of every day looking to be entertained by real experiences. The 360˚ video allows us to be engaging in an innovative way,” said Carsten Fredholm, Senior Vice President of Nestlé’s Beverage Strategic Business Unit.\n</p>\n<p><em>Nescafé</em> recently teamed up with Google to offer the first virtual reality coffee experience through the <em>Nescafé 360˚</em> app. It also became the first global brand to move its website onto Tumblr, to strengthen connections with younger fans by allowing them to create and share content.</p>\n<p>The Nestlé brand is one of only six globally to partner Facebook for the launch of this technology.</p></p>"
I can think of a regex approach to replace the unwanted character strings. However, is there a way to access the plain text elements of item "description" directly through xpath?
Any help with this issue, is very much appreciated. Thank you.
You can do:
descriptions <- sapply(descriptions, function(x) {
xmlValue(xmlRoot(htmlParse(x)))
}, USE.NAMES=FALSE)
which gives (via cat(stringr::str_wrap(descriptions[[1]], 70)):
In a move that will provide young Europeans increased access to
jobs and training opportunities, Nestlé and the Alliance for YOUth
have joined the European Pact for Youth as founding members. Seven
million people in Europe under the age of 25 are still inactive -
neither in employment, education or training. The European Pact for
Youth, created by European CSR business network CSR Europe and the
European Commission, aims to work together with businesses, youth
organisations, education providers and other stakeholders to reduce
skills gaps and increase youth employability. As part of the Pact, the
Alliance for YOUth will focus on setting up âdual learningâ schemes
across Europe, combining formal education with apprenticeships and on-
the-job training to help match skills with jobs on the market. The
Alliance for YOUth is a group of almost 200 companies mobilised by
Nestlé to help young people in Europe find work. It has pledged to
create 100,000 employability opportunities by 2017 and has already met
half of this target in its first year. Luis Cantarell, Executive Vice
President for Nestlé and co-initiator of the European Pact for Youth,
said: âPromoting a cultural shift to dual learning schemes based on
business-education collaboration is at the heart of Nestléâs youth
employment initiative since its start in 2013. The European Pact for
Youth will help to build a skilled workforce and will tackle youth
unemployment.â Learn more about the European Pact for Youth and read
their press release.
There are \n characters at various points in the resultant text (in almost all the descriptions) but you can gsub those away.

Parsing RTF files into R?

Couldn't find much support for this for R. I'm trying to read a number of RTF files into R to construct a data frame, but I'm struggling to find a good way to parse the RTF file and ignore the structure/formatting of the file. There are really only two lines of text I want to pull from each file -- but it's nested within the structure of the file.
I've pasted a sample RTF file below. The two strings I'd like to capture are:
"Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"
"The technology level [...] and managerial implications." (the full paragraph)
Any thoughts on how to efficiently parse this? I think regular expressions might help me, but I'm struggling to form the right expression to get the job done.
{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red109\green109\blue109;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720
\itap1\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clvertalt \clshdrawnil \clwWidth15680\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\itap2\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clmgf \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx4320
\clmrg \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\pard\intbl\itap2\pardeftab720
\f0\b\fs26 \cf0 Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products\nestcell
\pard\intbl\itap2\nestcell \lastrow\nestrow
\pard\intbl\itap1\pardeftab720
\f1\b0\fs24 \cf0 \
\pard\intbl\itap1\pardeftab720
\f0\fs26 \cf0 The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers\'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications.\cell \lastrow\row
\pard\pardeftab720
\f1\fs24 \cf0 \
}
1) A simple way if you are on Windows is to read it in using WordPad or Word and then save it as a plain text document.
2) Alternately, to parse it directly in R, read in the rtf file, find lines with the given pattern, pat producing g. Then replace any \\' strings with single quotes producing noq. Finally remove pat and any trailing junk. This works on the sample but you might need to revise the patterns if there are additional embedded \\ strings other than the \\' which we already handle:
Lines <- readLines("myfile.rtf")
pat <- "^\\\\f0.*\\\\cf0 "
g <- grep(pat, Lines, value = TRUE)
noq <- gsub("\\\\'", "'", g)
sub("\\\\.*", "", sub(pat, "", noq))
For the indicated file this is the output:
[1] "Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"
[2] "The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications."
Revised several times. Added Wordpad/Word solution.

Does anyone know of a service/db I can use for businesses and geocode?

I was wondering how and where companies like Foursquare/Gowalla find and keep up to date their list of location/businesses.
Is it a web service? Do they buy a directory and enter it into a database?
This is from a comment I found at http://www.quora.com/Where-or-how-does-a-company-like-Foursquare-get-a-directory-of-all-locations-and-their-addresses
Companies usually get place data from one of the following:
Data licenses: Companies like Localeze, InfoUSA, Amacai etc.
license location data: Big players like TeleAtlas and Navteq serve as global aggregators of this data. There are also lots of small niche players that license e.g. restaurant data only, or ATM data only, on a per country basis.
Crowd Sourcing. Some companies crowd source their data. Open Data Sets. There are
some data sets with a creative commons or other license from which location related data can be extracted. E.g. GeoCommons and Wikipedia. APIs. A number of companies provides APIs by which you can access data on the fly. This include GeoAPI.com, Google, Yelp and others.
In general, this data is fragmented both in type (e.g. POI vs neighborhood or geocode) and place (US vs UK vs South Africa vs Wherever)
Google has a geocoding service that's freely available for personal use.
For business use, it costs a few$, but it's still pretty reasonable.
And the API is pretty straightforward
http://code.google.com/apis/maps/documentation/javascript/v2/services.html

Resources