How to read a list of values into a data table in a sandbox? - r

I have a list of data. It's all a single column, each row is a comment from a post asking for book recommendations. Here's an example, containing the first 2 entries:
"My recommendations from books I read this year:<p>Bad Blood : Man, this book really does read like a Hollywood movie screenplay. The rise and fall of Theranos, documented through interviews with hundreds of ex-employees by the very author who came up with the first expose of Theranos. Truly shows the flaws in the "fake it before you make it" mindset and how we glorify "geniuses".<p>Shoe Dog : Biography of the founder of Nike. Really liked how it's not just a book glorifying the story of Nike, but tells the tale of how much effort, balance and even pure luck went into making the company the household name it is today.<p>Master Algorithm : It's a book about the different fields of Machine learning (from Bayesian to Genetic evolution algos) and talks about the pros and cons of each and how these can play together to create a "master algorithm" for learning. It's a good primer for people entering the field and while it's not a DIY, it shows the scope of the problem of learning as a whole.<p>Three Body Problem: Finally, after years of people telling me to read this (on HN and off), I read the trilogy (Remembrance of Earth's Past), and I must say, the series does live up to the hype. Not only is it fast paced and deeply philosophical, but it's presented in a format very accessible to casual readers as well (unlike many hard sci-fi books which seem to revel in complexity). If I had to describe this series in a single line, it's "What would happen if China was the country that made first contact with an alien race?"","A selection:<p>Sapiens (Yuval Noah Harari, 2014 [English]) - A bit late to the party on this one. Mostly enjoyed it, especially the early ancient history stuff, but I felt it got a bit contrived in the middle - like the author was forcing it. Overall a good read though.<p>How to Invent Everything (Ryan North, 2018) - First book I've pre-ordered in a long time. A look at the history of civilization and technology through a comedic lens. Pretty funny and enjoyable.<p>The Rise of Theodore Roosevelt (Edmund Morris, 1979) - Randomly happened across this book while browsing a used bookstore for some stuff to read on a summer vacation. Loved it. It's big, but reads pretty quick for a biography. I've been a fan of TR since I first really learned about him in High School and I would recommend this for anyone interested in TR/The West/Americana.<p>Jaws (Peter Benchley, 1974) - Quite a bit darker than the movie.<p>Sharp Objects (Gillian Flynn, 2006) - I enjoyed Gone Girl (book and film) so I wanted to read this before the HBO series. To be honest...not my cup of tea. It was <i>okay</i>.<p>The Art of Racing in the Rain (Garth Stein, 2008) - Made me cry on an airplane. Thankfully my coworkers were on a different flight."
(Notice, comments are separated by ",")
I'm trying to load this list into a data table in an R sandbox (rapporter.net). But because of browser security, I can't load a local file (fread, read.table).
How can I read raw data into a data table in R?

Related

Using OptaPlanner to create school time tables with some tricky constraints

I'm going to use OptaPlanner to lay out time tables for a school.
We're laying out the time tables for a full semester and every week could, if necessary, be slightly different.
There are some tricky constraints to take into account:
1. Weekly schedules
The lectures in one subject should be spread out somewhat evenly over the semester.
We can't for example put 20 math lectures the first week and "be done" with math for this semester.
In fact, it's nice to have some weekly predictibility
"Science year 2 have biology on Tuesday mornings"
This constraint must not be carved in stone however. Some weeks have to include work experience sessions, PE excursions, etc, in which case they must deviate from other weeks.
Problem
If I create a constraint that say, gives -1soft for not scheduling a subject the same time as the previous week, then OptaPlanner will waste a lot of time before it "accidentally" finds a good placement for a lecture, and even if it manages to converge so that each subject is scheduled the same time every week, it will never ever manage to move the entire series of lectures by moving them one by one. (That local optimum will never be escaped.)
2. Cross student group subjects
There's a large correlation between student groups and courses; For example, all students in Science year 2 mostly reads the same courses: Chemistry for Science year 2, Biology for Sience year 2, ...
The exception being language courses.
Each student can choose to study French, German or Spanish. So Spanish for year 2 is studied by a cross section of Science year 2 students, and Social Studies year 2 students, etc.
From the experience of previous (manual) scheduling, the optimal solution it's almost guaranteed to schedule all language classes in the same time slots. (If French is scheduled at 9 on Thursdays, then German and Spanish can be scheduled "for free" at 9 on Thursdays.)
Problem
There are many time slots in one semester, and the chances that OptaPlanner will discover a solution where all language lectures are scheduled at the same time by randomly moving individual lectures is small.
Also, similarly to problem 1: If OptaPlanner does manage to schedule French, German and Spanish at the same time, these "blocks" will never be moved elsewhere, since they are individual lectures, and the chances that all lectures will "randomly" move to the same new slot is tiny. Even with a large Tabu history length and so on.
My thoughts so far
As for problem 1 ("Weekly predictability") I'm thinking of doing the following:
In the construction phase for the full-semester-schedule I create a reduced version of the problem, that schedules (a reduced set of lectures) into a single "template week". Let's call it a "single-week-pre-scheduling". This template week is then repeated in the construction of the initial solution of the full semester which is the "real" planning entity.
The local search steps will then only focus on inserting PE excursions etc, and adjusting the schedule for the affected weeks.
As for problem 2 I'm thinking that the solution to problem 1 might solve this. In a 1 week schedule, it seems reasonable to assume that OptaPlaner will realize that language classes should be scheduled at the same time.
Regarding the local optimum settled by the single-week-pre-scheduling ("Biology is scheduled on Tuesday mornings"), I imagine that I could create a custom move operation that "bundles" these lectures into a single move. I have no idea how simple this is. I would really like to keep the code as simple as possible.
Questions
Are my thoughts reasonable? Is there a more clever way to approach these problems? If I have to create custom moves anyways, perhaps I don't need to construct a template-week?
Is there a way to assign hints or weights to moves? If so, I could perhaps generate moves with slightly larger weight that adjusts scheduling to adhere to predictable weeks and language scheduled in the same time slots.
A question well asked!
With regards to your first problem, I suggest you take a look at OptaWeb Employee Rostering and the concept of rotations. A rotation is "how things generally are" and then Planner has the freedom to diverge from the rotation at a penalty. Once you understand the concept of the rotation from the UI, take a look at the planning entity Shift and how the rotation is implemented with the use of employee and rotationEmployee variables. Note that only the employee is an actual #PlanningVariable, with the rotationEmployee being fixed.
That means that you have to define your rotations manually, therefore doing the work of the solver yourself. However, since this operation is only done once a semester I assume, maybe the solution could be to have a simpler solver generate a reasonable general rotation first, and then a second solver would take it and figure out the specific necessary adjustments?
With regards to your second problem, rotations could help there too. But I'm thinking maybe some move filtering and custom moves to help OptaPlanner to either move all language classes, or none? Writing efficient custom moves is not easy, and filtering stock moves is cumbersome. So I would only do it when the potential of other options is exhausted. If you end up doing this, look for MoveIteratorFactory.
My answer is a little vague, as we do not get into the specifics of the domain model, but for the purposes of designing the overall solution, it hopefully gives enough clues.

Extract words from review comments using R programming

Greetings!!
I am suppose to extract comments from a gaming site and then find out what exactly has people liked and disliked about the game.
I have achieved the first part of extracting the comments from the web pages and store them in a data frame.
Now I have columns like "Liked" and "Disliked" in my data frame. I want to fetch the specific words in the Liked and Disliked columns.
For eg:
Liked
"I like their website, it looks great in my opinion and I am feeling very good when the design attracts me in this way!So I signed up for an account, it took me only a couple of minutes and then I decided to make my first deposit here and to try my luck with Microgaming slots that are my favorite although sometimes I am losing serious amounts of money. Because they have a decent welcome bonus, I made a deposit of 25 euro via Skrill and I received 25 euro bonus. I want to say that this casino is very good in my opinion even if it’s first time when I am playing here.The welcome bonus impressed me, I will give a 10 because the wagering requirements are more than decent. Regarding their games, I have nothing bad to say because they have a lot of slots from different providers so I will give a 9. I recommend you this casino because it’s safe to play, it has lot of games and good welcome bonus!"
Disliked
"I wasn’t able to see any chat option, this would be the only bad thing!"
So from the like comment I want words like: good design, decent welcome bonus, safe to play.
And for Disliked: No chat option
Can this be achieved? Request you to kindly help me with this. Any help would be highly appreciated.
Thanks and regards,
Ani
Here is how you can do this.
You can select a range of factors (such as "like", "dislike", "hate", "love") which express best emotions and apply the code below.
z <- data.frame(group = c("liked", "disliked", "liked"),
comment = c("I love this game", "I hate this game", "I like the game"))
results <- z %>%
group_by(group) %>%
summarise(positive_feedbacks = length(unique(comment[grepl("love|like", comment)])),
negative_feedbacks = length(unique(comment[grepl("hate", comment)])))
This way you can count the amounts of positive and negative feedbacks to start with.

Split block of text into separate parts

New to R so apolgies if this is obvious..
Given a text document containing a sample block of text such as the following:
Deputy Kermit: Sir, providing access to good education for all the
Utoppia's children is one of our most important responsibilities as
States’ Members. We all recognise that. On the morning we began to
debate the future of selection in secondary education that was why
feelings ran so high, and why it was so closely fought.
But our responsibility does not stop at the doors of this Assembly.
For the sake of practicality we delegate day-to-day policy
responsibilities to individual Committees. As Deputy Fozzy has rightly
said, the Committee is the agent of the States. Ultimately it should
do what it is told. So there should be no doubt that the buck stops
with us, the States, to be sure that our agent, the Committee, has the
skills, strength and experience necessary for the task we have
assigned to it. If the Committee is not the right one for the task
ahead, especially if it is a task of vital importance to our Island,
then it is our duty to deal with that. We must remember that there is
no hierarchy here, no power to hire or fire discreetly in this
Assembly. If a Committee is in the wrong job but it does not step
down, the only tool we have to manage that is a motion of no
confidence.
Deputy Fozzy’s record too is similar, he just said that change is a
recipe for disaster. On the steps of the States after December’s
debate he told us that Utopia would rue –
The Bailiff: Deputy Fozzy.
Deputy Fozzy: That was never said on the steps of this Assembly
after the debate. I have said nothing ever like that after the debate.
I think you need to check your facts.
The Bailiff: Through the Chair,
Deputy Kermit: I repeat what I heard in the media, sir.
I would like to split each speakers statements out into their own separate file. What are my options to do this, given the speakers title (in this example Deputy or Baliff) and the character ':' may also occur within the block of text?
Not sure about the sentence breaks here...just an attempt.
Regex:
(^|[\W\S]\s*)(([A-Z][a-z]+\s?)+:)
Replacement:
$1\n\n$2
Output:
Deputy Kermit: Sir, providing access to good education for all the Utoppia's children is one of our most important responsibilities as States’ Members. We all recognise that. On the morning we began to debate the future of selection in secondary education that was why feelings ran so high, and why it was so closely fought.
But our responsibility does not stop at the doors of this Assembly. For the sake of practicality we delegate day-to-day policy responsibilities to individual Committees. As Deputy Fozzy has rightly said, the Committee is the agent of the States. Ultimately it should do what it is told. So there should be no doubt that the buck stops with us, the States, to be sure that our agent, the Committee, has the skills, strength and experience necessary for the task we have assigned to it. If the Committee is not the right one for the task ahead, especially if it is a task of vital importance to our Island, then it is our duty to deal with that. We must remember that there is no hierarchy here, no power to hire or fire discreetly in this Assembly. If a Committee is in the wrong job but it does not step down, the only tool we have to manage that is a motion of no confidence.
Deputy Fozzy’s record too is similar, he just said that change is a recipe for disaster. On the steps of the States after December’s debate he told us that Utopia would rue –
The Bailiff: Deputy Fozzy.
Deputy Fozzy: That was never said on the steps of this Assembly after the debate. I have said nothing ever like that after the debate. I think you need to check your facts.
The Bailiff: Through the Chair,
Deputy Kermit: I repeat what I heard in the media, sir.

Bypass Style Formatting when Parsing RSS Feed in R

I am trying to scrape and parse the following RSS feed http://www.nestle.com/_handlers/rss.ashx?q=068f9d6282034061936dbe150c72d197. I have no problem to extract the basic items that I need (e.g., title, description, pubDate) using the following code:
library(RCurl)
library(XML)
xml.url <- "http://www.nestle.com/_handlers/rss.ashx?q=068f9d6282034061936dbe150c72d197"
script <- getURL(xml.url)
doc <- xmlParse(script)
titles <- xpathSApply(doc,'//item/title',xmlValue)
descriptions <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)
My problem is that the output for item "description" includes not only the actual text but also a lot of style formatting expressions. For example, the first element is:
descriptions[1]
[1] "<p><iframe height=\"322\" src=\"https://www.youtube-nocookie.com/embed/fhESDXnlMa0?rel=0\" frameBorder=\"0\" width=\"572\"></iframe><br />\n<br />\n<p><em>Nescafé</em> is partnering with Facebook to launch an immersive video, pioneering new technology just released for the platform.</p>\n<p>\nThe <em>Nescafé</em> <a class=\"externalLink\" title=\"Opens in a new window: Nescafé on Facebook\" href=\"https://www.facebook.com/Nescafe/videos/vb.203900255471/10156233581755472/?type=2&theater\" target=\"_blank\">‘Good Morning World’ video</a> stars people in kitchens across the world, performing the hit song ‘Don’t Worry’ using spoons, cups, forks and a jar of coffee. Uniquely, viewers can rotate their smartphones through 360˚ to explore the video, the first time this has been possible on Facebook.</p>\n<p>\n“We know young coffee lovers pick up their phone at the start of every day looking to be entertained by real experiences. The 360˚ video allows us to be engaging in an innovative way,” said Carsten Fredholm, Senior Vice President of Nestlé’s Beverage Strategic Business Unit.\n</p>\n<p><em>Nescafé</em> recently teamed up with Google to offer the first virtual reality coffee experience through the <em>Nescafé 360˚</em> app. It also became the first global brand to move its website onto Tumblr, to strengthen connections with younger fans by allowing them to create and share content.</p>\n<p>The Nestlé brand is one of only six globally to partner Facebook for the launch of this technology.</p></p>"
I can think of a regex approach to replace the unwanted character strings. However, is there a way to access the plain text elements of item "description" directly through xpath?
Any help with this issue, is very much appreciated. Thank you.
You can do:
descriptions <- sapply(descriptions, function(x) {
xmlValue(xmlRoot(htmlParse(x)))
}, USE.NAMES=FALSE)
which gives (via cat(stringr::str_wrap(descriptions[[1]], 70)):
In a move that will provide young Europeans increased access to
jobs and training opportunities, Nestlé and the Alliance for YOUth
have joined the European Pact for Youth as founding members. Seven
million people in Europe under the age of 25 are still inactive -
neither in employment, education or training. The European Pact for
Youth, created by European CSR business network CSR Europe and the
European Commission, aims to work together with businesses, youth
organisations, education providers and other stakeholders to reduce
skills gaps and increase youth employability. As part of the Pact, the
Alliance for YOUth will focus on setting up âdual learningâ schemes
across Europe, combining formal education with apprenticeships and on-
the-job training to help match skills with jobs on the market. The
Alliance for YOUth is a group of almost 200 companies mobilised by
Nestlé to help young people in Europe find work. It has pledged to
create 100,000 employability opportunities by 2017 and has already met
half of this target in its first year. Luis Cantarell, Executive Vice
President for Nestlé and co-initiator of the European Pact for Youth,
said: âPromoting a cultural shift to dual learning schemes based on
business-education collaboration is at the heart of Nestléâs youth
employment initiative since its start in 2013. The European Pact for
Youth will help to build a skilled workforce and will tackle youth
unemployment.â Learn more about the European Pact for Youth and read
their press release.
There are \n characters at various points in the resultant text (in almost all the descriptions) but you can gsub those away.

Looping through website links with R

I am looking to incorporate a loop in R which goes through every game's boxscore data on the NFL statistics website here: http://www.pro-football-reference.com/years/2012/games.htm
At the moment I am having to manually click on the "boxscore" link for every game every week; is there any way to automate this in R? My code works with the Full play-by-play dataset within each link; it's taking me ages at the moment!
Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable.
require(RCurl)
require(XML)
bdata<-getURL('http://www.pro-football-reference.com/years/2012/games.htm')
bdata<-htmlParse(bdata)
boxdata<-xpathSApply(bdata,'//a[contains(#href,"boxscore")]',xmlAttrs)[-1]
The above will get the boxscore stem for the various games.

Resources