Related
I'm trying to read a huge file with fread, but i guess something is messed with the layout of the file.
If i try to read the file with
data = fread(input = "../data.txt", sep = "\t")
on this file (i just took the line with the error and few before and after):
ID imdbID Title Year Rating Runtime Genre Released Director Writer Cast Metacritic imdbRating imdbVotes Poster Plot FullPlot Language Country Awards lastUpdated Type
683 tt0000683 The Fatal Hour 1908 14 min Short, Crime 1908-08-18 D.W. Griffith D.W. Griffith George Gebhardt, Harry Solter, Linda Arvidson, Florence Auer 5.9 26 Pong Lee, a Mephistophelian, saffron-skinned varlet, has for some time carried on this atrocious female white slave traffic, in which sinister business he was assisted by a stygian whelp, ... Pong Lee, a Mephistophelian, saffron-skinned varlet, has for some time carried on this atrocious female white slave traffic, in which sinister business he was assisted by a stygian whelp, by name Hendricks. Pong writes Hendricks that he has need for five young girls, and so Hendricks sets out to secure them. Visiting a rural district, he has no trouble, by his glib, affable manner, in gaining the confidence of several young and pretty girls. Pong is on hand with a closed carriage to bag the prey. One of the girls, as she is seized, emits a yell that alarms the neighborhood and brings to the scene several policemen and a couple of detectives, who have long been on the lookout for these caitiffs. The Chinese get away with the carriage, however, and Hendricks by subterfuge throws the police on the wrong scent. One of the detectives is a woman, and possessed of shrewd powers of deduction, hence does not swallow the bald story of the villain, and exercises her natural acumen with success. She shadows Hendricks, and by means of a flirtation inveigles him to a restaurant, where she succeeds in doping his drink. He falls asleep and she secures the letter written by Pong, which discloses the hiding place of the Chinaman. This she immediately telephones to the police, and while so doing Hendricks awakes and starts off to warn his friends. He arrives at the old deserted house ahead of the police, but escape is impossible, so the police rescue the girls, but fail to secure Pong and Hendricks, who afterwards seize the girl detective, and taking her to the house, tie her to a post and arrange a large pistol on the face of a clock in such a way that when the hands point to twelve the gun is fired and the girl will receive the charge. Twenty minutes are allowed for them to get away, for the hands are now indicating 11:40. Certain death seems to be her fate, and would have been had not an accident disclosed her plight. Hendricks after leaving the place is thrown by a street car, and this serves to discover his identity, so he is captured and a wild ride is made to the house in which the poor girl is incarcerated. This incident is shown in alternate scenes. There is the helpless girl, with the clock ticking its way towards her destruction, and out on the road is the carriage, tearing along at breakneck speed to the rescue, arriving just in time to get her safely out of range of the pistol as it goes off. In conclusion we can promise this to be an exceedingly thrilling film, of more than ordinary interest. English USA 2015-10-24 01:44:09.623000000 movie
684 tt0000684 Father Gets in the Game 1908 10 min Short, Comedy 1908-10-10 D.W. Griffith D.W. Griffith Mack Sennett, Harry Solter, George Gebhardt, Linda Arvidson 5.1 39 "You have got to keep up with the bandwagon or quit." This never impressed old Wilkins so forcibly as when his son and daughter give him the go-by, stamping him as a "has-been," and away ... "You have got to keep up with the bandwagon or quit." This never impressed old Wilkins so forcibly as when his son and daughter give him the go-by, stamping him as a "has-been," and away out of the game. Even Mrs. Wilkins, who is as vivacious as a widow, snubs him. He keenly feels his condition and resolves to alter it. With this in view, he enlists the services of Professor Dyem, the celebrated Dermatologist and Tonsorial Artist. After a session with the Professor, beheld the transmogrified Wilkins. What a change! Shorn of his grizzly beard, his locks raven, complexion florid, eye clear and step elastic, he views himself in the mirror. He hardly recognizes himself. In fact, it requires his valet to convince him that he is he. "Am I in it? Well. I guess. If I don't keep up with and even beat that bandwagon by a city block, my name is not Pill Wilkins." He sallies forth and makes for the park. The first person he encounters is his wife. He approaches her in elation, but she mistakes him for an impudent masher and he receives the weight of her parasol over his head for his trouble. The next one he meets is his daughter. She is seated on a bench, waiting for Charley. He takes a seat beside her and when he tries to make himself known she draws herself up to full height and with a blow sends him backward over the bench onto the grass. Well, he changes his tactics, and gets reckless. Along comes his son with his best girl, so he decides to win her out for spite. Now this young lady has a sensitive pneumogastric nerve, and when he sits beside her on the bench and slyly suggests a cold bottle and a hot bird, she is "his'n." This is done so coolly and so quickly, that young Wilkins, who, of course, does not recognize his respected papa, is speechless with rage. He follows them, however, to the café, where his intrusion is resented and he is rudely thrown from the place. At the Wilkins' domicile there is an indignation meeting. Mother, daughter and son all rush in to relate their experiences to father. He is not to be found. Suddenly a hilarious individual enters. "'Tis he, the insulter: a drunk and disorderly." They are about to have him thrown out when the valet comes to his rescue and explains that the jubilant gentleman is no other than their dear papa, who has not only caught up with the bandwagon, but is sitting on the seat with the driver. They all gasp in surprise, and young Wilkins takes a wreath of laurel from a statue and places it on old Wilkins' brow, saying: "Pop, you are the candy." English USA 2015-10-02 04:59:48.643000000 movie
685 tt0000685 The Feud and the Turkey 1908 15 min Short, Drama, Romance 1908-12-08 D.W. Griffith D.W. Griffith Harry Solter, Linda Arvidson, Arthur V. Johnson, Robert Harron 5.8 13 The Wilkinsons and Caulfields, owing to a trivial dispute, had been at loggerheads for years and as time went on the feeling became more bitter, until they even forbade their children ... The Wilkinsons and Caulfields, owing to a trivial dispute, had been at loggerheads for years and as time went on the feeling became more bitter, until they even forbade their children playing together. The little ones, however, in their childish innocence, could not appreciate the odium of their elders, and Bobby Wilkinson and Nellie Caulfield became child lovers. This incensed Colonel Wilkinson, who tore the children apart, ordered Bobby never to be seen in her company again. The Colonel's action ignited the ire of the Caulfields and a furious conflict ensued, resulting in the shooting to death of George, the Colonel's youngest son, a boy of fourteen. From that time on the clans kept strictly to themselves. But love knows no clannishness, and, despite family hatred, Bob and Nellie remained lovers. After ten years, driven to desperation by this apparently insurmountable barrier, they elope and are married. Bob decides to brave the storm of his father's anger and present his wife, but the old Colonel drives him from the house, disowning him. Old Aunt Dinah and Uncle Daniel, the colored servants, were so attached to the young folks that they go with them. Two years later we find the little family, now increased by an infant son, having a hard of it. It is Christmas morning and no turkey for dinner. Old Aunt Dinah, believing in the efficacy of prayer, gets down on her knees in the kitchen to ask the good Lord to send them a bird. Uncle Daniel, touched by this demonstration of faith, takes a gun and determines to get a turkey at any hazard. Over the hills he goes, but his journey is hopelessly fruitless until he comes to the rear of the Colonel's house. Tillie, the cook, has just hung a fat turkey on a post outside the kitchen door. When Daniel sees it he can't resist the temptation. Back home he hustles and finds Dinah still at prayer, he lays the fowl on the floor beside her and sneaks out. When Dinah sees it she surely thinks it was due to her prayers. Well, the turkey is cooked and an old-fashioned Christmas anticipated. Meanwhile the Colonel has discovered his loss and tracks the thief to Bob's estate. Entering, a tragedy seems inevitable, but when the old Colonel sees the young one, his grandson, in the cradle, his heart goes out to it and the feud ends then and there. All hands sit down and enjoy a real Merry Christmas dinner. English USA 2015-08-29 00:33:15.610000000 movie
686 tt0000686 Fiestas del carnaval de 1908 en Barcelona 1908 Documentary, Short Fructuós Gelabert Fructuós Gelabert Spain 2015-11-09 14:24:29.583000000 movie
I get this error:
> Error in fread(input = "../data.txt", sep="\t" : Expected sep (' ') but new line, EOF (or other
> non printing character) ends field 20 when detecting types ( first):
> 684 tt0000684 Father Gets in the Game 1908 10 min Short,
> Comedy 1908-10-10 D.W. Griffith D.W. Griffith Mack Sennett, Harry
> Solter, George Gebhardt, Linda Arvidson 5.1 39 "You have got to keep
> up with the bandwagon or quit." This never impressed old Wilkins so
> forcibly as when his son and daughter give him the go-by, stamping him
> as a "has-been," and away ... "You have got to keep up with the
> bandwagon or quit." This never impressed old Wilkins so forcibly as
> when his son and daughter give him the go-by, stamping him as a
> "has-been," and away out of the game. Even Mrs. Wilkins, who is as
> vivacious as a widow, snubs him. He keenly feels his condition and
> resolves to alter it. With this in view, he enlists the services of
> Professor Dyem, the celebrated Dermatologist and Tonsorial Artist.
> After a session with the Professor, beheld the transmogrified Wilkins.
> W
How can i solve it?
I'm not 100% sure what the error is in your data, here, but try running the code with
data = fread(input = "../data.txt", sep = "\t", fill = TRUE)
in the fread options. I had a similar error, and it seemed that fread was having trouble identifying my column separation. Setting fill to true allows fread to fill in any missing data- at least then you can check the resulting data frame and find out where the weirdness is.
Add fill = TRUE in the syntax
What's happening: The rows in the data have unequal length. With this syntax, blank fields are implicitly filled.
Please anyone can help me to import angle brackets data into R from a unix executable file. It seems like an XML type so, I tried to use XML parser but it failed.
I have attached sample file.
Thanks in advance.
https://drive.google.com/file/d/0B97ow4h4jwHcRTVtWHdudDJ0c1k/view?usp=sharing
'&' characters exist in elements in your XML document.
One example is below:
<DOC>
<DATE>01/07/2009</DATE>
<AUTHOR>Debce</AUTHOR>
<TEXT>I have owned my MDX for about 1 1/2 yrs & have loved every minute of driving the 24k problem free miles on it! It is so much fun to drive; looks & feels luxurious so no problem pulling up to upscale places! I didn't want to give up space to pop things in the back and go so I keep the third seat down & purchased the rubber mat for the back. I have plenty of room while at the same time I am "zippy"; easily pulling into parking spaces and getting around town. I love the navigation system, although it does need updating and the bluetooth is wonderful, although for some reason it keeps unhooking my Treo phone which the Acura people say is the phone's fault. LOVE IT & would buy it again.</TEXT>
<FAVORITE>Large storage area, hands free phone with the bluetooth & voice recognition is safe. The heaviness of it feels safe and large interior is very comfortable. </FAVORITE>
</DOC>
'&' characters should be escaped.
'>'
'<'
'&'
'%'
characters are special characters which should be escaped in an XML document.
Here is a way of extracting the data into a character matrix.
> require(XML)
> x <- htmlParse("/temp/2007_acura_mdx")
>
> # get the 'DOC'
> docs <- getNodeSet(x, "//doc")
>
> # display one
> docs[[1]]
<doc>
<date>07/31/2009</date>
<author>FlewByU</author>
<text>I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. </text>
<favorite>The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. </favorite>
</doc>
>
> # process docs getting all fields -- need to transpose
> results <- t(sapply(docs, function(x) xmlSApply(x, xmlValue)))
>
> # show head
> head(results)
date author
[1,] "07/31/2009" "FlewByU"
[2,] "07/30/2009" "cvillemdx"
[3,] "06/22/2009" "Pleased"
[4,] "04/13/2009" "wasatch7"
[5,] "04/06/2009" "mnozek"
[6,] "01/07/2009" "Debce"
text
[1,] "I just moved to Germany two months ago and bought an 07 MDX from another military member. It has everything I could want. We just returned from a week driving through the Alps and this SUV is simply amazing. Granted, I get to drive it much faster than I could in the states, but even at 120 MPH, it was rock solid. We need the AWD for the snow and the kids stay entertained with the AV system. Plenty of passing power and very comfortable on long trips. Acuras are rare in Germany and I get stares all the time by curious Bavarians wondering what kind of vehicle I have. If you are in the market for a luxury SUV for family touring, with cool tech toys to play with, MDX can't be beat. "
[2,] "After months of careful research and test drives at BMW, Lexus, Volvo, etc. I settled on the MDX without a doubt in mind. I love the way the car handles, no stiffness or resistance in the steering or acceleration. The interior design is a little Star Trek for me, but once I figured everything out, it is a pleasure to have all the extras (XM radio, navigation, Bluetooth, backup camera, etc.)"
[3,] "I'm two years into a three year lease and I love this car. The only thing I would change would be the shape of the grill...THAT'S IT. Everything else is perfect. Great performance, plenty of power and AWD when skiing, plenty of room for baggage, great MPG for an SUV, navi system is far superior to GM's Suburban (don't have to put in park to change your destination, etc). Zero problems...just gas and oil changes. One beautiful car...except for the sho-gun shield looking grill."
[4,] "First luxury crossover SUV I have owned. MDX won out over the Lexus, and cost less for a very well equipped base package. Handling, power and ride are outstanding. Back seats are a little less comfortable for my tall teenagers. Back cargo area is very roomy, and easily expandable with 3rd seat folded and back seats down. I drive up snowy, often treacherous mountain canyons to ski in the winter. The SH-AWD system, coupled with the manual shift mode (for descents), is outstanding. The MDX is much better in the snow than 3 truck base SUVs, I have owned previously. "
[5,] "This is the first Japanese SUV we have had in a while. Last SUV's were Yukon XL and Envoy XL. This beats them out by far. Performs almost as well as our Mercedes e class but has the utility of our Envoy. We always take this on trips and it is very comfortable. The third row is great for smaller children but not so much for adults. Best SUV so far. No problems within our almost 2 years ownership."
[6,] "I have owned my MDX for about 1 1/2 yrs & have loved every minute of driving the 24k problem free miles on it! It is so much fun to drive; looks & feels luxurious so no problem pulling up to upscale places! I didn't want to give up space to pop things in the back and go so I keep the third seat down & purchased the rubber mat for the back. I have plenty of room while at the same time I am \"zippy\"; easily pulling into parking spaces and getting around town. I love the navigation system, although it does need updating and the bluetooth is wonderful, although for some reason it keeps unhooking my Treo phone which the Acura people say is the phone's fault. LOVE IT & would buy it again."
favorite
[1,] "The separate controls for the rear passengers are awesome. I can control temp and AV from the front or switch to rear. Sound system is amazing. I will sometimes sit in the driveway and just listen. Also has a 120v outlet in console. Great for us since we live with 220v and need 120 on occasion. "
[2,] "The self-adjusting side mirrors which rotate to give you a view of the curb/lines as you back up. Makes backing into parking spaces and parallel parking a breeze, along with the back-up camera. Also a fan of the push-to-talk for my cell phone."
[3,] "Navi is easy, hands-free is great, AWD is perfect."
[4,] "AWD system, exterior styling, cargo room"
[5,] "Navigation, sound system, bluetooth, comfort, acceleration, performance, all wheel drive ability."
[6,] "Large storage area, hands free phone with the bluetooth & voice recognition is safe. The heaviness of it feels safe and large interior is very comfortable. "
>
>
>
Does anyone know how to replicate the (pg_trgm) postgres trigram similarity score from the similarity(text, text) function in R? I am using the stringdist package and would rather use R to calculate these on a matrix of text strings in a .csv file than run a bunch of postgresql quires.
Running similarity(string1, string2) in postgres give me a number score between 0 and 1.
I tired using the stringdist package to get a score but I think I still need to divide the code below by something.
stringdist(string1, string2, method="qgram",q = 3 )
Is there a way to replicate the pg_trgm score with the stringdist package or another way to do this in R?
An example would be getting the similarity score between the description of a book and the description of a genre like science fiction. For example, if I have two book descriptions and the using the similarity score of
book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."
How can I get a similarity score for the description of each book against the description of the science fiction genre like pg_trgm using an R script?
How about something like this?
library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
list(
text1="hello there",
text2="why hello there",
text3="totally different"
),
method="cosine"),
3)
# text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000
I'm trying to use the Naive Bayes Learner from e1071 to do spam analysis. This is the code I use to set up the model.
library(e1071)
emails=read.csv("emails.csv")
emailstrain=read.csv("emailstrain.csv")
model<-naiveBayes(type ~.,data=emailstrain)
there a two sets of emails that both have a 'statement' and a type. One is for training and one is for testing. when I run
model
and just read the raw output it seems that it gives a higher then zero percent chance to a statement being spam when it is indeed spam and the same is true for when the statement is not. However when I try to use the model to predict the testing data with
table(predict(model,emails),emails$type)
I get that
ham spam
ham 2086 321
spam 2 0
which seems wrong. I also tried using the training set to test the data on as well, and in this case it should give quite good results, or at least as good as what was observed in the model. However it gave
ham spam
ham 2735 420
spam 0 6
which is only slightly better then with the testing set. I think it must be something wrong with how the predict function is working.
how the data files are set up and some examples of whats inside:
type,statement
ham,How much did ur hdd casing cost.
ham,Mystery solved! Just opened my email and he's sent me another batch! Isn't he a sweetie
ham,I can't describe how lucky you are that I'm actually awake by noon
spam,This is the 2nd time we have tried to contact u. U have won the £1450 prize to claim just call 09053750005 b4 310303. T&Cs/stop SMS 08718725756. 140ppm
ham,"TODAY is Sorry day.! If ever i was angry with you, if ever i misbehaved or hurt you? plz plz JUST SLAP URSELF Bcoz, Its ur fault, I'm basically GOOD"
ham,Cheers for the card ... Is it that time of year already?
spam,"HOT LIVE FANTASIES call now 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870..k"
ham,"When people see my msgs, They think Iam addicted to msging... They are wrong, Bcoz They don\'t know that Iam addicted to my sweet Friends..!! BSLVYL"
ham,Ugh hopefully the asus ppl dont randomly do a reformat.
ham,"Haven't seen my facebook, huh? Lol!"
ham,"Mah b, I'll pick it up tomorrow"
ham,Still otside le..u come 2morrow maga..
ham,Do u still have plumbers tape and a wrench we could borrow?
spam,"Dear Voucher Holder, To claim this weeks offer, at you PC please go to http://www.e-tlp.co.uk/reward. Ts&Cs apply."
ham,It vl bcum more difficult..
spam,UR GOING 2 BAHAMAS! CallFREEFONE 08081560665 and speak to a live operator to claim either Bahamas cruise of£2000 CASH 18+only. To opt out txt X to 07786200117
I would really love suggestions. Thank you so much for your help
Actually predict function works just fine. Don't get me wrong but problem is in what you are doing. You are building the model using this formula: type ~ ., right? It is clear what we have on the left-hand side of the formula so lets look at the right-hand side.
In your data you have only to variables - type and statement and because type is dependent variable only thing that counts as independent variable is statement. So far everything is clear.
Let's take a look at Bayesian Classifier. A priori probabilities are obvious, right? What about
conditional probabilities? From the classifier point of view you have only one categorical Variable (your sentences). For the classifier point it is only some list of labels. All of them are unique so a posteriori probabilities will be close to the the a priori.
In other words only thing we can tell when we get a new observation is that probability of it being spam is equal to probability of message being spam in your train set.
If you want to use any method of machine learning to work with natural language you have to pre-process your data first. Depending on you problem it could for example mean stemming, lemmatization, computing n-gram statistics, tf-idf. Training classifier is the last step.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I'm doing some basic text analysis in R and want to count the number of lines for a transcript from a .txt file that I load into R. With the example below to yield a count in which each speaker gets a new line attached to the linecount such that Mr. Smith = 4, Mr. Gordon = 6, Mr. Catalano = 3.
[71] "\"511\"\t\"MR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\""
[72] "\"513\"\t\"MR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\""
[73] "\"515\"\t\"MR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""
The function countLine() doesn't work since it requires a connection - these are just .txt imported into R. I realize that the line count is subject to the formatting of what the text is opened up in, but any general help on if this is feasible would help. Thanks.
I didn't think your example was reproducible, so I edited it to get it to contain what you posted but I do not know if the names will match:
txtvec <- structure(list(`'511' ` = "MR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\"",
`'513' ` = "MR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\"",
`'515' ` = "MR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""), .Names = c("'511'\t",
"'513'\t", "'515'\t"))
So it's only a matter or running a regex expression across it and tabling the results:
> table( sapply(txtvec, function(x) sub("(^MR.+)\\:.+", "\\1", x) ) )
#MR Catalano MR Gordon MR Smith
1 1 1
There was concern expressed that the names were not in the original structure. This is another version with unnamed vector and a slightly modified regex:
txtvec <- c("\"511\"\t\"\nMR Smith: Mr Speaker, I like the spirit in which we are agreeing on this. The administration of FUFA is present here. FUFA could be used as a conduit, but the intention of what hon. Beti Kamya brought up and what hon. Rose Namayanja has said was okufuwa - just giving a token of appreciation to the players who achieved this.\"",
"\"513\"\t\"\nMR Gordon: Thank you very much, Mr Speaker. FUFA is an organisation and the players are the ones who got the cup for us. To promote motivation in all activities, not only football, you should remunerate people who have done well. In this case, we have heard about FUFA with their problems. They have not paid water bills and they can take this money to pay the water bills. If we agree that this money is supposed to go to the players and the coaches, then when it goes there they would know the amount and they will sit among themselves and distribute according to what we will have given. (Applause) I thank you.\"",
"\"515\"\t\"\nMR Catalano: Mr Speaker, I want to give information to my dear colleagues. The spirit is very good but you must be mindful that the administration of FUFA is what has made this happen. The money to the players. That indicates to you that FUFA is very trustworthy. This is not the old FUFA we are talking about.\""
)
table( sapply(txtvec, function(x) sub(".+\\n(MR.+)\\:.+", "\\1", x) ) )
#MR Catalano MR Gordon MR Smith
# 1 1 1
To count the number of "lines" these would occupy on a wrapping device with 80 characters per line you could use this code (which could easily be converted to a function):
sapply(txtvec, function(tt) 1+nchar(tt) %/% 80)
#[1] 5 8 4
This is raised in the comments, but it really bares being it's own answer:
You cannot "count lines" without defining what a "line" is. A line is a very vague concept and can vary by the program being used.
Unless of course the data contains some indicator of a line break, such as \n. But even then, you would not be counting lines, you would be counting linebreaks. You would then have to ask yourself if the hardcoded line break is in accord with what you are hoping to analyze.
--
If your data does not contain linebreaks, but you still want to count the number of lines, then we're back to the question of "how do you define a line"? The most basic way, is as #flodel suggests, which is to use character length. For example, you can define a line as 76 characters long, and then take
ceiling(nchar(X) / 76))
This of course assumes that you can cut words. (If you need words to remain whole, then you have to get craftier)