How to count the frequency of unique words in a column? - r

I need to count the frequency of unique words found in a column that contains descriptions in each row.
So far, I have eliminated a list of stopwords from the original column and I have extracted the unique words from the column and put them into a list called unique_description.
> description[1:5]
[1] "Come stay Vinh & Stuart (Awarded one Australia's top hosts Airbnb CEO Brian Chesky & key shareholder Ashton Kutcher. We're Sydney's #1 reviewed hosts ). Find out 've positively reviewed 500+ times. Message talk first BEFORE make reservation request - And please read listing end (hint hint). Everything need know . We're pretty relaxed hosts, fully appreciate staying someone , home home, -one. This business, hotel. We're casual Airbnb hosts, hoteliers. If 're looking alternative expensive hotel, 're . Here 'll treated same way treat family & friends stay. So... fluffy bathrobes... Please hello message *BEFORE* make reservation request... It'll help speed things up, smooth things out... Please read listing way end. It make getting confirmed reserv"
[2] "Beautifully renovated, spacious quiet, 3 Bedroom, 3 Bathroom home 10 minute walk beaches Fairlight Forty Baskets, 30 minute walk Manly via coastal promenade, Express bus runs 20 mins door. Our home thirty minute walk along seashore promenade Manly, one Sydney's beautiful beaches, village restaurants, cafes, shopping. If prefer more variety, Manly ferry take Sydney CBD 15 minutes. The residence sited sought- family-friendly street short stroll nearby North Harbour reserve Forty Baskets cafe beach. It's short walk further express CBD buses, ferries, Manly entertainment. Or bus (#131 #132) around corner drops Manly 8 minutes. Our home features stainless steel galley kitchen, including Ilve oven gas cooktop. We two separate living areas ground floor. The front lounge enjoys P&O"
[3] "Welcome sanctuary - bright, comfortable one bedroom apartment North Sydney. Free Wifi, heated pool/jacuzzi everything need make stay Sydney very comfortable. Enjoy fabulous Home away home, fantastic stay Sydney! The apartment within walking distance restaurants shops, Luna Park North Sydney business district. Access Sydney CBD easy bus, train, taxi ferry. It short bus ride famous Balmoral Beach Taronga Zoo. My apartment situated North Sydney 3 kms Sydney CBD. Here details apartment: You'll enjoy being centrally located couple blocks away train station go anywhere quickly Sydney. The apartment features several windows tons natural light. It comfortable fully stocked. Here's I here: LIVING ROOM: 50\" LCD TV DVD / blu ray player CD/Radio/Blue tooth syncing w"
[4] "Fully self-contained sunny studio apartment. 10mn walk Bondi beach. Bus city door. Private 13m swimming pool. Sunny, studio apartment . Private terrace. bus door Bondi Junction City Ground floor 1 bedroom double bed plus kitchenette & study desk. shower & toilet, share laundry, kitchen facilities Swimming pool 13m. Separate security private entrance Private entrance. Ground floor. Happy indicate best spots walking, dining, entertaining best sightseeing location Sydney. Upmarket area. Very nice quiet neighbourhood . Very safe place. Bus door city."
[5] "Sunny warehouse/loft apartment heart one Sydney's best neighbourhoods. Located corner two iconic Darlinghurst streets famous laneway bars eateries, footsteps equally amazing Surry Hills Potts Point. Walk through beautiful parks city less 10 mins, opera house 20 access Bondi Beach easily 25 via bus stop directly front building. My apartment beautiful, simple, open plan / one bedroom loft soaring high ceilings hardwood floors hint 's previous life printing factory 1940s. It huge windows flood space glorious sunshine throughout day provide refreshing breeze during summer. A few key features: * Wireless harman/kardon aura stereo system stream music wirelessly bluetooth device * Internal laundry washer dryer * The kitchen equipped gas cooking, microwave, dishwasher basics preparing m"
> unique_description[1:10]
[1] "Come" "stay" "Vinh" "&" "Stuart" "(Awarded"
[7] "one" "Australia's" "top" "hosts"
I'm not sure how to count the frequency of the words in unique_description that are found in the column 'description'. I tried using freq_terms in library(qdap), but qdap will not load for me so am trying to find another way.

You could use the stringr package.
library(stringr)
x <- "Come stay Vinh & Stuart (Awarded one Australia's top hosts Airbnb CEO Brian Chesky & key shareholder Ashton Kutcher. We're Sydney's #1 reviewed hosts ). Find out 've positively reviewed 500+ times. Message talk first BEFORE make reservation request - And please read listing end (hint hint). Everything need know . We're pretty relaxed hosts, fully appreciate staying someone , home home, -one. This business, hotel. We're casual Airbnb hosts, hoteliers. If 're looking alternative expensive hotel, 're . Here 'll treated same way treat family & friends stay. So... fluffy bathrobes... Please hello message *BEFORE* make reservation request... It'll help speed things up, smooth things out... Please read listing way end. It make getting confirmed reserv"
y <- "come stay Stuart"
unique_desc <- c("come", "stay", "Stuart")
desc <- c(x,y)
result <- lapply(desc, FUN = str_count, pattern = unique_desc)
#result holds first element is counts in first element of desc
lapply will call the str_count function to each element of desc. In this example the ith entry of result there is a vector of counts corresponding to the ith entry of desc and the vector of counts correspond to the counts of each word in unique_desc.

Related

Bayesian networks for text analysis in R

I have one page story (i.e. text data), I need to use Bayesian network on that story and analyse the same. Could someone tell me whether it is possible in R? If yes, that how to proceed?
The objective of the analysis is - Extract Action Descriptions from
Narrative Text.
The data considered for analysis -
Krishna’s Dharam-shasthra to Arjuna:
The Gita is the conversation between Krishna and Arjuna leading up to the battle.
Krishna emphasised on two terms: Karma and Dharma. He told Arjun that this was a righteous war; a war of Dharma. Dharma is the way of righteousness or a set of rules and laws laid down. The Kauravas were on the side of Adharma and had broken rules and laws and hence Arjun would have to do his Karma to uphold Dharma.
Arjuna doesn't want to fight. He doesn't understand why he has to shed his family's blood for a kingdom that he doesn't even necessarily want. In his eyes, killing his evil and killing his family is the greatest sin of all. He casts down his weapons and tells Krishna he will not fight. Krishna, then, begins the systematic process of explaining why it is Arjuna's dharmic duty to fight and how he must fight in order to restore his karma.
Krishna first explains the samsaric cycle of birth and death. He says there is no true death of the soul simply a sloughing of the body at the end of each round of birth and death. The purpose of this cycle is to allow a person to work off their karma, accumulated through lifetimes of action. If a person completes action selflessly, in service to God, then they can work off their karma, eventually leading to a dissolution of the soul, the achievement of enlightenment and vijnana, and an end to the samsaric cycle. If they act selfishly, then they keep accumulating debt, putting them further and further into karmic debt.
What I want is - post tagger to separate verbs, nouns etc. and then create a meaningful network using that.
The steps that should be followed in pre-processing are:
syntactic processing (post tagger)
SRL algorithm (semantic role labelling of characters of the story)
conference resolution
Using all of the above I need to create a knowledge database and create a Bayesian network.
This is what I have tried so far to get post tagger:
txt <- c("As the years went by, they remained isolated in their city. Their numbers increased by freeing women from slavery.
Doom would come to the world in the form of Ares the god of war and the Son of Zeus. Ares was unhappy with the gods as he wanted to prove just how foul his father’s creation was. Hence, he decided to corrupt the mortal men created by Zeus. Fearing his wrath upon the world Zeus decided to create the God killer in order to stop Ares. He then commanded Hippolyta to mould a baby from the sand and clay of the island. Then the five goddesses went back into the Underworld, drawing out the last soul that remained in the Well and giving it incredible powers. The soul was merged with the clay and became flesh. Hippolyta had her daughter and named her Diana, Princess of the Amazons, the first child born on Paradise Island.
Each of the six members of the Greek Pantheon granted Diana a gift: Demeter, great strength; Athena, wisdom and courage; Artemis, a hunter's heart and a communion with animals; Aphrodite, beauty and a loving heart; Hestia, sisterhood with fire; Hermes, speed and the power of flight. Diana was also gifted with a sword, the Lasso of truth and the bracelets of penance as weapons to defeat Ares.
The time arrived when Diana, protector of the Amazons and mankind was sent to the Man's World to defeat Ares and rid the mortal men off his corruption. Diana believed that only love could truly rid the world of his influence. Diana was successfully able to complete the task she was sent out by defeating Ares and saving the world.
")
writeLines(txt, tf <- tempfile())
library(stringi)
library(cleanNLP)
cnlp_init_tokenizers()
anno <- cnlp_annotate(tf)
names(anno)
get_token(anno)
cnlp_init_spacy()
anno <- cnlp_annotate(tf)
get_token(anno)
cnlp_init_corenlp()

fread EOF instead of separator

I'm trying to read a huge file with fread, but i guess something is messed with the layout of the file.
If i try to read the file with
data = fread(input = "../data.txt", sep = "\t")
on this file (i just took the line with the error and few before and after):
ID imdbID Title Year Rating Runtime Genre Released Director Writer Cast Metacritic imdbRating imdbVotes Poster Plot FullPlot Language Country Awards lastUpdated Type
683 tt0000683 The Fatal Hour 1908 14 min Short, Crime 1908-08-18 D.W. Griffith D.W. Griffith George Gebhardt, Harry Solter, Linda Arvidson, Florence Auer 5.9 26 Pong Lee, a Mephistophelian, saffron-skinned varlet, has for some time carried on this atrocious female white slave traffic, in which sinister business he was assisted by a stygian whelp, ... Pong Lee, a Mephistophelian, saffron-skinned varlet, has for some time carried on this atrocious female white slave traffic, in which sinister business he was assisted by a stygian whelp, by name Hendricks. Pong writes Hendricks that he has need for five young girls, and so Hendricks sets out to secure them. Visiting a rural district, he has no trouble, by his glib, affable manner, in gaining the confidence of several young and pretty girls. Pong is on hand with a closed carriage to bag the prey. One of the girls, as she is seized, emits a yell that alarms the neighborhood and brings to the scene several policemen and a couple of detectives, who have long been on the lookout for these caitiffs. The Chinese get away with the carriage, however, and Hendricks by subterfuge throws the police on the wrong scent. One of the detectives is a woman, and possessed of shrewd powers of deduction, hence does not swallow the bald story of the villain, and exercises her natural acumen with success. She shadows Hendricks, and by means of a flirtation inveigles him to a restaurant, where she succeeds in doping his drink. He falls asleep and she secures the letter written by Pong, which discloses the hiding place of the Chinaman. This she immediately telephones to the police, and while so doing Hendricks awakes and starts off to warn his friends. He arrives at the old deserted house ahead of the police, but escape is impossible, so the police rescue the girls, but fail to secure Pong and Hendricks, who afterwards seize the girl detective, and taking her to the house, tie her to a post and arrange a large pistol on the face of a clock in such a way that when the hands point to twelve the gun is fired and the girl will receive the charge. Twenty minutes are allowed for them to get away, for the hands are now indicating 11:40. Certain death seems to be her fate, and would have been had not an accident disclosed her plight. Hendricks after leaving the place is thrown by a street car, and this serves to discover his identity, so he is captured and a wild ride is made to the house in which the poor girl is incarcerated. This incident is shown in alternate scenes. There is the helpless girl, with the clock ticking its way towards her destruction, and out on the road is the carriage, tearing along at breakneck speed to the rescue, arriving just in time to get her safely out of range of the pistol as it goes off. In conclusion we can promise this to be an exceedingly thrilling film, of more than ordinary interest. English USA 2015-10-24 01:44:09.623000000 movie
684 tt0000684 Father Gets in the Game 1908 10 min Short, Comedy 1908-10-10 D.W. Griffith D.W. Griffith Mack Sennett, Harry Solter, George Gebhardt, Linda Arvidson 5.1 39 "You have got to keep up with the bandwagon or quit." This never impressed old Wilkins so forcibly as when his son and daughter give him the go-by, stamping him as a "has-been," and away ... "You have got to keep up with the bandwagon or quit." This never impressed old Wilkins so forcibly as when his son and daughter give him the go-by, stamping him as a "has-been," and away out of the game. Even Mrs. Wilkins, who is as vivacious as a widow, snubs him. He keenly feels his condition and resolves to alter it. With this in view, he enlists the services of Professor Dyem, the celebrated Dermatologist and Tonsorial Artist. After a session with the Professor, beheld the transmogrified Wilkins. What a change! Shorn of his grizzly beard, his locks raven, complexion florid, eye clear and step elastic, he views himself in the mirror. He hardly recognizes himself. In fact, it requires his valet to convince him that he is he. "Am I in it? Well. I guess. If I don't keep up with and even beat that bandwagon by a city block, my name is not Pill Wilkins." He sallies forth and makes for the park. The first person he encounters is his wife. He approaches her in elation, but she mistakes him for an impudent masher and he receives the weight of her parasol over his head for his trouble. The next one he meets is his daughter. She is seated on a bench, waiting for Charley. He takes a seat beside her and when he tries to make himself known she draws herself up to full height and with a blow sends him backward over the bench onto the grass. Well, he changes his tactics, and gets reckless. Along comes his son with his best girl, so he decides to win her out for spite. Now this young lady has a sensitive pneumogastric nerve, and when he sits beside her on the bench and slyly suggests a cold bottle and a hot bird, she is "his'n." This is done so coolly and so quickly, that young Wilkins, who, of course, does not recognize his respected papa, is speechless with rage. He follows them, however, to the café, where his intrusion is resented and he is rudely thrown from the place. At the Wilkins' domicile there is an indignation meeting. Mother, daughter and son all rush in to relate their experiences to father. He is not to be found. Suddenly a hilarious individual enters. "'Tis he, the insulter: a drunk and disorderly." They are about to have him thrown out when the valet comes to his rescue and explains that the jubilant gentleman is no other than their dear papa, who has not only caught up with the bandwagon, but is sitting on the seat with the driver. They all gasp in surprise, and young Wilkins takes a wreath of laurel from a statue and places it on old Wilkins' brow, saying: "Pop, you are the candy." English USA 2015-10-02 04:59:48.643000000 movie
685 tt0000685 The Feud and the Turkey 1908 15 min Short, Drama, Romance 1908-12-08 D.W. Griffith D.W. Griffith Harry Solter, Linda Arvidson, Arthur V. Johnson, Robert Harron 5.8 13 The Wilkinsons and Caulfields, owing to a trivial dispute, had been at loggerheads for years and as time went on the feeling became more bitter, until they even forbade their children ... The Wilkinsons and Caulfields, owing to a trivial dispute, had been at loggerheads for years and as time went on the feeling became more bitter, until they even forbade their children playing together. The little ones, however, in their childish innocence, could not appreciate the odium of their elders, and Bobby Wilkinson and Nellie Caulfield became child lovers. This incensed Colonel Wilkinson, who tore the children apart, ordered Bobby never to be seen in her company again. The Colonel's action ignited the ire of the Caulfields and a furious conflict ensued, resulting in the shooting to death of George, the Colonel's youngest son, a boy of fourteen. From that time on the clans kept strictly to themselves. But love knows no clannishness, and, despite family hatred, Bob and Nellie remained lovers. After ten years, driven to desperation by this apparently insurmountable barrier, they elope and are married. Bob decides to brave the storm of his father's anger and present his wife, but the old Colonel drives him from the house, disowning him. Old Aunt Dinah and Uncle Daniel, the colored servants, were so attached to the young folks that they go with them. Two years later we find the little family, now increased by an infant son, having a hard of it. It is Christmas morning and no turkey for dinner. Old Aunt Dinah, believing in the efficacy of prayer, gets down on her knees in the kitchen to ask the good Lord to send them a bird. Uncle Daniel, touched by this demonstration of faith, takes a gun and determines to get a turkey at any hazard. Over the hills he goes, but his journey is hopelessly fruitless until he comes to the rear of the Colonel's house. Tillie, the cook, has just hung a fat turkey on a post outside the kitchen door. When Daniel sees it he can't resist the temptation. Back home he hustles and finds Dinah still at prayer, he lays the fowl on the floor beside her and sneaks out. When Dinah sees it she surely thinks it was due to her prayers. Well, the turkey is cooked and an old-fashioned Christmas anticipated. Meanwhile the Colonel has discovered his loss and tracks the thief to Bob's estate. Entering, a tragedy seems inevitable, but when the old Colonel sees the young one, his grandson, in the cradle, his heart goes out to it and the feud ends then and there. All hands sit down and enjoy a real Merry Christmas dinner. English USA 2015-08-29 00:33:15.610000000 movie
686 tt0000686 Fiestas del carnaval de 1908 en Barcelona 1908 Documentary, Short Fructuós Gelabert Fructuós Gelabert Spain 2015-11-09 14:24:29.583000000 movie
I get this error:
> Error in fread(input = "../data.txt", sep="\t" : Expected sep (' ') but new line, EOF (or other
> non printing character) ends field 20 when detecting types ( first):
> 684 tt0000684 Father Gets in the Game 1908 10 min Short,
> Comedy 1908-10-10 D.W. Griffith D.W. Griffith Mack Sennett, Harry
> Solter, George Gebhardt, Linda Arvidson 5.1 39 "You have got to keep
> up with the bandwagon or quit." This never impressed old Wilkins so
> forcibly as when his son and daughter give him the go-by, stamping him
> as a "has-been," and away ... "You have got to keep up with the
> bandwagon or quit." This never impressed old Wilkins so forcibly as
> when his son and daughter give him the go-by, stamping him as a
> "has-been," and away out of the game. Even Mrs. Wilkins, who is as
> vivacious as a widow, snubs him. He keenly feels his condition and
> resolves to alter it. With this in view, he enlists the services of
> Professor Dyem, the celebrated Dermatologist and Tonsorial Artist.
> After a session with the Professor, beheld the transmogrified Wilkins.
> W
How can i solve it?
I'm not 100% sure what the error is in your data, here, but try running the code with
data = fread(input = "../data.txt", sep = "\t", fill = TRUE)
in the fread options. I had a similar error, and it seemed that fread was having trouble identifying my column separation. Setting fill to true allows fread to fill in any missing data- at least then you can check the resulting data frame and find out where the weirdness is.
Add fill = TRUE in the syntax
What's happening: The rows in the data have unequal length. With this syntax, blank fields are implicitly filled.

How to import angle brackets <></> data into R

Please anyone can help me to import angle brackets data into R from a unix executable file. It seems like an XML type so, I tried to use XML parser but it failed.
I have attached sample file.
Thanks in advance.
https://drive.google.com/file/d/0B97ow4h4jwHcRTVtWHdudDJ0c1k/view?usp=sharing
'&' characters exist in elements in your XML document.
One example is below:
<DOC>
<DATE>01/07/2009</DATE>
<AUTHOR>Debce</AUTHOR>
<TEXT>I have owned my MDX for about 1 1/2 yrs & have loved every minute of driving the 24k problem free miles on it! It is so much fun to drive; looks & feels luxurious so no problem pulling up to upscale places! I didn't want to give up space to pop things in the back and go so I keep the third seat down & purchased the rubber mat for the back. I have plenty of room while at the same time I am "zippy"; easily pulling into parking spaces and getting around town. I love the navigation system, although it does need updating and the bluetooth is wonderful, although for some reason it keeps unhooking my Treo phone which the Acura people say is the phone's fault. LOVE IT & would buy it again.</TEXT>
<FAVORITE>Large storage area, hands free phone with the bluetooth & voice recognition is safe. The heaviness of it feels safe and large interior is very comfortable. </FAVORITE>
</DOC>
'&' characters should be escaped.
'>'
'<'
'&'
'%'
characters are special characters which should be escaped in an XML document.
Here is a way of extracting the data into a character matrix.
> require(XML)
> x <- htmlParse("/temp/2007_acura_mdx")
>
> # get the 'DOC'
> docs <- getNodeSet(x, "//doc")
>
> # display one
> docs[[1]]
<doc>
<date>07/31/2009</date>
<author>FlewByU</author>
<text>I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. </text>
<favorite>The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. </favorite>
</doc>
>
> # process docs getting all fields -- need to transpose
> results <- t(sapply(docs, function(x) xmlSApply(x, xmlValue)))
>
> # show head
> head(results)
date author
[1,] "07/31/2009" "FlewByU"
[2,] "07/30/2009" "cvillemdx"
[3,] "06/22/2009" "Pleased"
[4,] "04/13/2009" "wasatch7"
[5,] "04/06/2009" "mnozek"
[6,] "01/07/2009" "Debce"
text
[1,] "I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. "
[2,] "After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I love the way the car handles, no stiffness or resistance in the steering or acceleration. The interior design is a little Star Trek for me, but once I figured everything out, it is a pleasure to have all the extras (XM radio, navigation, Bluetooth, backup camera, etc.)"
[3,] "I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT'S IT. Everything else is perfect. Great performance, plenty of power and AWD when skiing, plenty of room for baggage, great MPG for an SUV, navi system is far superior to GM's Suburban (don't have to put in park to change your destination, etc). Zero problems...just gas and oil changes. One beautiful car...except for the sho-gun shield looking grill."
[4,] "First luxury crossover SUV I have owned. MDX won out over the Lexus, and cost less for a very well equipped base package. Handling, power and ride are outstanding. Back seats are a little less comfortable for my tall teenagers. Back cargo area is very roomy, and easily expandable with 3rd seat folded and back seats down. I drive up snowy, often treacherous mountain canyons to ski in the winter. The SH-AWD system, coupled with the manual shift mode (for descents), is outstanding. The MDX is much better in the snow than 3 truck base SUVs, I have owned previously. "
[5,] "This is the first Japanese SUV we have had in a while. Last SUV's were Yukon XL and Envoy XL. This beats them out by far. Performs almost as well as our Mercedes e class but has the utility of our Envoy. We always take this on trips and it is very comfortable. The third row is great for smaller children but not so much for adults. Best SUV so far. No problems within our almost 2 years ownership."
[6,] "I have owned my MDX for about 1 1/2 yrs & have loved every minute of driving the 24k problem free miles on it! It is so much fun to drive; looks & feels luxurious so no problem pulling up to upscale places! I didn't want to give up space to pop things in the back and go so I keep the third seat down & purchased the rubber mat for the back. I have plenty of room while at the same time I am \"zippy\"; easily pulling into parking spaces and getting around town. I love the navigation system, although it does need updating and the bluetooth is wonderful, although for some reason it keeps unhooking my Treo phone which the Acura people say is the phone's fault. LOVE IT & would buy it again."
favorite
[1,] "The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. "
[2,] "The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking spaces and parallel parking a breeze, along with the back-up camera. Also a fan of the push-to-talk for my cell phone."
[3,] "Navi is easy, hands-free is great, AWD is perfect."
[4,] "AWD system, exterior styling, cargo room"
[5,] "Navigation, sound system, bluetooth, comfort, acceleration, performance, all wheel drive ability."
[6,] "Large storage area, hands free phone with the bluetooth & voice recognition is safe. The heaviness of it feels safe and large interior is very comfortable. "
>
>
>

Importing a text file including <> angle brackets into R

I have data file which has angle brackets from http://kavita-ganesan.com/opinosis-opinion-dataset.
<DOCNO>2007_acura_mdx</DOCNO>
<DOC>
<DATE>07/31/2009</DATE>
<AUTHOR>FlewByU</AUTHOR>
<TEXT>I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. </TEXT>
<FAVORITE>The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. </FAVORITE>
</DOC>
<DOC>
<DATE>07/30/2009</DATE>
<AUTHOR>cvillemdx</AUTHOR>
<TEXT>After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I love the way the car handles, no stiffness or resistance in the steering or acceleration. The interior design is a little Star Trek for me, but once I figured everything out, it is a pleasure to have all the extras (XM radio, navigation, Bluetooth, backup camera, etc.)</TEXT>
<FAVORITE>The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking spaces and parallel parking a breeze, along with the back-up camera. Also a fan of the push-to-talk for my cell phone.</FAVORITE>
</DOC>
<DOC>
<DATE>06/22/2009</DATE>
<AUTHOR>Pleased</AUTHOR>
<TEXT>I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT'S IT. Everything else is perfect. Great performance, plenty of power and AWD when skiing, plenty of room for baggage, great MPG for an SUV, navi system is far superior to GM's Suburban (don't have to put in park to change your destination, etc). Zero problems...just gas and oil changes. One beautiful car...except for the sho-gun shield looking grill.</TEXT>
<FAVORITE>Navi is easy, hands-free is great, AWD is perfect.</FAVORITE>
</DOC>
It seems like an XML file, but when I tried to
xml.url <- "2007_acura_mdx"
xmlfile <- xmlTreeParse(xml.url)
class(xmlfile)
xmltop <- xmlRoot(xmlfile)
topxml <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml), row.names=NULL)
I had a problem when I executed data.frame. Can anyone help me? At this moment I would like to use grep()`` and gsub() but this is also not easy.
Try this:
txt <- "<DOCNO>2007_acura_mdx</DOCNO>
<DOC>
<DATE>07/31/2009</DATE>
<AUTHOR>FlewByU</AUTHOR>
<TEXT>I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. </TEXT>
<FAVORITE>The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. </FAVORITE>
</DOC>
<DOC>
<DATE>07/30/2009</DATE>
<AUTHOR>cvillemdx</AUTHOR>
<TEXT>After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I love the way the car handles, no stiffness or resistance in the steering or acceleration. The interior design is a little Star Trek for me, but once I figured everything out, it is a pleasure to have all the extras (XM radio, navigation, Bluetooth, backup camera, etc.)</TEXT>
<FAVORITE>The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking spaces and parallel parking a breeze, along with the back-up camera. Also a fan of the push-to-talk for my cell phone.</FAVORITE>
</DOC>
<DOC>
<DATE>06/22/2009</DATE>
<AUTHOR>Pleased</AUTHOR>
<TEXT>I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT'S IT. Everything else is perfect. Great performance, plenty of power and AWD when skiing, plenty of room for baggage, great MPG for an SUV, navi system is far superior to GM's Suburban (don't have to put in park to change your destination, etc). Zero problems...just gas and oil changes. One beautiful car...except for the sho-gun shield looking grill.</TEXT>
<FAVORITE>Navi is easy, hands-free is great, AWD is perfect.</FAVORITE>
</DOC>"
library(XML)
txt2 <- paste("<root>", txt, "</root>")
doc <- xmlTreeParse(txt2, asText = TRUE, useInternalNodes = TRUE)
L <- xpathApply(doc, "//DOC", xmlApply, FUN = xmlValue)
dd <- do.call(rbind, lapply(L, as.data.frame, stringsAsFactors = FALSE))
giving:
> str(dd)
'data.frame': 3 obs. of 4 variables:
$ DATE : chr "07/31/2009" "07/30/2009" "06/22/2009"
$ AUTHOR : chr "FlewByU" "cvillemdx" "Pleased"
$ TEXT : chr "I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We ju"| __truncated__ "After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I lov"| __truncated__ "I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT"| __truncated__
$ FAVORITE: chr "The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound sy"| __truncated__ "The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking sp"| __truncated__ "Navi is easy, hands-free is great, AWD is perfect."

Replicate Postgres pg_trgm text similarity scores in R?

Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000

Resources