Related
I have some columns titles essay 0-9, I want to iterate over them count the words and then make a new column with the number of words. so essay0 will get a column essay0_num with 5 if that is how many words it has in it.
so far i got cupid <- cupid %>% mutate(essay9_num = sapply(strsplit(essay9, " "), length))
to count the words and add a column but i don't want to do it one by one for all 10.
i tried a for loop:
for (i in 0:31) {
cupid <- cupid %>% mutate(xxx_num = sapply(strsplit(xxx, " "), length))
}
but i am not sure how iterate the columns in a for loop in R. I thought maybe i can pull out the columns i need and put them into a new df and use sapply somehow that way? but i still run into the problem of iterating over the columns.
dput:
dput(head(cupid))
structure(list(age = c(22L, 35L, 38L, 23L, 29L, 29L), status = c("single",
"single", "available", "single", "single", "single"), sex = c("m",
"m", "m", "m", "m", "m"), orientation = c("straight", "straight",
"straight", "straight", "straight", "straight"), body_type = c("a little extra",
"average", "thin", "thin", "athletic", "average"), diet = c("strictly anything",
"mostly other", "anything", "vegetarian", "", "mostly anything"
), drinks = c("socially", "often", "socially", "socially", "socially",
"socially"), drugs = c("never", "sometimes", "", "", "never",
""), education = c("working on college/university", "working on space camp",
"graduated from masters program", "working on college/university",
"graduated from college/university", "graduated from college/university"
), ethnicity = c("asian, white", "white", "", "white", "asian, black, other",
"white"), height = c(75, 70, 68, 71, 66, 67), income = c(-1L,
80000L, -1L, 20000L, -1L, -1L), job = c("transportation", "hospitality / travel",
"", "student", "artistic / musical / writer", "computer / hardware / software"
), last_online = c("2012-06-28-20-30", "2012-06-29-21-41", "2012-06-27-09-10",
"2012-06-28-14-22", "2012-06-27-21-26", "2012-06-29-19-18"),
location = c("south san francisco, california", "oakland, california",
"san francisco, california", "berkeley, california", "san francisco, california",
"san francisco, california"), offspring = c("doesn't have kids, but might want them",
"doesn't have kids, but might want them", "", "doesn't want kids",
"", "doesn't have kids, but might want them"), pets = c("likes dogs and likes cats",
"likes dogs and likes cats", "has cats", "likes cats", "likes dogs and likes cats",
"likes cats"), religion = c("agnosticism and very serious about it",
"agnosticism but not too serious about it", "", "", "", "atheism"
), sign = c("gemini", "cancer", "pisces but it doesn’t matter",
"pisces", "aquarius", "taurus"), smokes = c("sometimes",
"no", "no", "no", "no", "no"), speaks = c("english", "english (fluently), spanish (poorly), french (poorly)",
"english, french, c++", "english, german (poorly)", "english",
"english (fluently), chinese (okay)"), essay0 = c("about me: i would love to think that i was some some kind of intellectual: either the dumbest smart guy, or the smartest dumb guy. can't say i can tell the difference. i love to talk about ideas and concepts. i forge odd metaphors instead of reciting cliches. like the simularities between a friend of mine's house and an underwater salt mine. my favorite word is salt by the way (weird choice i know). to me most things in life are better as metaphors. i seek to
make myself a little better everyday, in some productively lazy way. got tired of tying my shoes. considered hiring a five year old, but would probably have to tie both of our shoes... decided to only wear leather shoes dress shoes. about you: you love to have really serious, really deep conversations about really silly stuff. you have to be willing to snap me out of a light hearted rant with a kiss. you don't have to be funny, but you have to be able to make me laugh. you should be able to bend spoons with your
mind, and telepathically make me smile while i am still at work. you should love life, and be cool with just letting the wind blow. extra points for reading all this and guessing my favorite video game (no hints given yet). and lastly you have a good attention span.",
"i am a chef: this is what that means. 1. i am a workaholic. 2. i love to cook regardless of whether i am at work. 3. i love to drink and eat foods that are probably really bad for me. 4. i love being around people that resemble line 1-3. i love the outdoors and i am an avid skier. if its snowing i will be in tahoe at the very least. i am a very confident and friendly. i'm not interested in acting or being a typical guy. i have no time or patience for rediculous acts of territorial pissing. overall i am a very
likable easygoing individual. i am very adventurous and always looking forward to doing new things and hopefully sharing it with the right person.",
"i'm not ashamed of much, but writing public text on an online dating site makes me pleasantly uncomfortable. i'll try to be as earnest as possible in the noble endeavor of standing naked before the world. i've lived in san francisco for 15 years, and both love it and find myself frustrated with its deficits. lots of great friends and acquaintances (which increases my apprehension to put anything on this site), but i'm feeling like meeting some new people that aren't just friends of friends. it's okay if you are a friend of a friend too. chances are, if you make it through the complex filtering process of multiple choice questions, lifestyle statistics, photo scanning, and these indulgent blurbs of text without moving quickly on to another search result, you are probably already a cultural peer and at most 2 people removed. at first, i thought i should say as little as possible here to avoid
you, but that seems silly. as far as culture goes, i'm definitely more on the weird side of the spectrum, but i don't exactly wear it on my sleeve. once you get me talking, it will probably become increasingly apparent that while i'd like to think of myself as just like everybody else (and by some definition i certainly am), most people don't see me that way. that's fine with me. most of the people i find myself gravitating towards are pretty weird themselves. you probably are too.",
"i work in a library and go to school. . .", "hey how's it going? currently vague on the profile i know, more to come soon. looking to meet new folks outside of my circle of friends. i'm pretty responsive on the reply tip, feel free to drop a line. cheers.",
"i'm an australian living in san francisco, but don't hold that against me. i spend most of my days trying to build cool stuff for my company. i speak mandarin and have been known to bust out chinese songs at karaoke. i'm pretty cheeky. someone asked me if that meant something about my arse, which i find really funny. i'm a little oddball. i have a wild imagination; i like to think
of the most improbable reasons people are doing things just for fun. i love to laugh and look for reasons to do so. occasionally this gets me in trouble because people think i'm laughing at them. sometimes i am, but more often i'm only laughing at myself. i'm an entrepreneur (like everyone else in sf, it seems) and i love what i do. i enjoy parties and downtime in equal measure. intelligence really turns me on and i love people who can teach me new things."
), essay1 = c("currently working as an international agent for a freight forwarding company. import, export, domestic you know the works. online classes and trying to better myself in my free time. perhaps a hours worth of a good book or a video game on a
lazy sunday.",
"dedicating everyday to being an unbelievable badass.", "i make nerdy software for musicians, artists, and experimenters to indulge in their own weirdness, but i like to spend time away from the computer when working on my artwork (which is typically more
concerned with group dynamics and communication, than with visual form, objects, or technology). i also record and deejay dance, noise, pop, and experimental music (most of which electronic or at least studio based). besides these relatively ego driven activities, i've been enjoying things like meditation and tai chi to try and gently flirt with ego death.",
"reading things written by old dead people", "work work work work + play",
"building awesome stuff. figuring out what's important. having adventures. looking for treasure."
), essay2 = c("making people laugh. ranting about a good salting. finding simplicity in complexity, and complexity in simplicity.",
"being silly. having ridiculous amonts of fun wherever. being a smart ass. ohh and i can cook. ;)",
"improvising in different contexts. alternating between being present and decidedly outside of a moment, or trying to hold both at once. rambling intellectual conversations that hold said conversations in contempt while seeking to find something that transcends them. being critical while remaining generous. listening to and using body language--often performed in caricature or large
gestures, if not outright interpretive dance. dry, dark, and raunchy humor.",
"playing synthesizers and organizing books according to the library of congress classification system",
"creating imagery to look at: http://bagsbrown.blogspot.com/ http://stayruly.blogspot.com/",
"imagining random shit. laughing at aforementioned random shit. being goofy. articulating what i think and feel. convincing people i'm right. admitting when i'm wrong. i'm also pretty good at helping people think through problems; my friends say i give good advice. and when i don't have a clue how to help, i will say: i give pretty good hug."
), essay3 = c("the way i look. i am a six foot half asian, half caucasian mutt. it makes it tough not to notice me, and for me to blend in.",
"", "my large jaw and large glasses are the physical things people comment on the most. when sufficiently stimulated, i have an unmistakable cackle of a laugh. after that, it goes in more directions than i care to describe right now. maybe i'll come back to this.",
"socially awkward but i do my best", "i smile a lot and my inquisitive nature",
"i have a big smile. i also get asked if i'm wearing blue-coloured contacts (no)."
), essay4 = c("books: absurdistan, the republic, of mice and men (only book that made me want to cry), catcher in the rye, the prince. movies: gladiator, operation valkyrie, the producers, down periscope. shows: the borgia, arrested development, game of
thrones, monty python music: aesop rock, hail mary mallon, george thorogood and the delaware destroyers, felt food: i'm down for anything.",
"i am die hard christopher moore fan. i don't really watch a lot of tv unless there is humor involved. i am kind of stuck on 90's alternative music. i am pretty much a fan of everything though... i do need to draw a line at most types of electronica.",
"okay this is where the cultural matrix gets so specific, it's like being in the crosshairs. for what it's worth, i find myself reading more non-fiction than fiction. it's usually some kind of philosophy, art, or science text by silly authors such as ranciere, de certeau, bataille, baudrillard, butler, stein, arendt, nietzche, zizek, etc. i'll often throw in some weird new age or pop-psychology book in the mix as well. as for fiction, i enjoy what little i've read of eco, perec, wallace, bolao, dick, vonnegut, atwood, delilo, etc. when i was young, i was a rabid asimov reader. directors i find myself drawn to are makavejev, kuchar, jodorowsky, herzog, hara, klein, waters, verhoeven, ackerman, hitchcock, lang, gorin, goddard, miike, ohbayashi, tarkovsky, sokurov, warhol, etc. but i also like a good amount of \"trashy\" stuff. too much to name. i definitely enjoy the character development that happens in long form episodic television over the course of 10-100 episodes, which a 1-2hr movie usually can't compete with. some of my recent tv favorites are: breaking bad, the wire, dexter, true blood, the prisoner, lost, fringe. a smattered sampling of
the vast field of music i like and deejay: art ensemble, sun ra, evan parker, lil wayne, dj funk, mr. fingers, maurizio, rob hood, dan bell, james blake, nonesuch recordings, omar souleyman, ethiopiques, fela kuti, john cage, meredith monk, robert ashley, terry riley, yoko ono, merzbow, tom tom club, jit, juke, bounce, hyphy, snap, crunk, b'more, kuduro, pop, noise, jazz, techno, house,
acid, new/no wave, (post)punk, etc. a few of the famous art/dance/theater folk that might locate my sensibility: andy warhol, bruce nauman, yayoi kusama, louise bourgeois, tino sehgal, george kuchar, michel duchamp, marina abramovic, gelatin, carolee schneeman, gustav metzger, mike kelly, mike smith, andrea fraser, gordon matta-clark, jerzy grotowski, samuel beckett, antonin artaud, tadeusz kantor, anna halperin, merce cunningham, etc. i'm clearly leaving out a younger generation of contemporary artists, many of whom are friends. local food regulars: sushi zone, chow, ppq, pagolac, lers ros, burma superstar, minako, shalimar, delfina pizza, rosamunde, arinells, suppenkuche, cha-ya, blue plate, golden era, etc.",
"bataille, celine, beckett. . . lynch, jarmusch, r.w. fassbender. . . twin peaks & fishing w/ john joy division, throbbing gristle, cabaret voltaire. . . vegetarian pho and coffee",
"music: bands, rappers, musicians at the moment: thee oh sees. forever: wu-tang books: artbooks for days audiobooks: my collection, thick (thanks audible) shows: live ones food: with stellar friends whenever movies > tv podcast: radiolab, this american life, the moth, joe rogan, the champs",
"books: to kill a mockingbird, lord of the rings, 1984, the farseer trilogy. music: the beatles, frank sinatra, john mayer, jason mraz, deadmau5, andrew bayer, everything on anjunadeep records, bach, satie. tv shows: how i met your mother, scrubs, the west wing, breaking bad. movies: star wars, the godfather pt ii, 500 days of summer, napoleon dynamite, american beauty, lotr food: thai, vietnamese, shanghai dumplings, pizza!"
), essay5 = c("food. water. cell phone. shelter.", "delicious porkness in all of its glories. my big ass doughboy's sinking into 15 new inches. my overly resilient liver. a good sharp knife. my ps3... it plays blurays too. ;) my over the top energy and my
outlook on life... just give me a bag of lemons and see what happens. ;)",
"movement conversation creation contemplation touch humor",
"", "", "like everyone else, i love my friends and family, and need hugs, human contact, water and sunshine. let's take that as given. 1. something to build 2. something to sing 3. something to play on (my guitar would be first choice) 4. something to write/draw on 5. a big goal worth dreaming about 6. something to laugh at"
), essay6 = c("duality and humorous things", "", "", "cats and german philosophy",
"", "what my contribution to the world is going to be and/or should be. and what's for breakfast. i love breakfast."
), essay7 = c("trying to find someone to hang out with. i am down for anything except a club.",
"", "viewing. listening. dancing. talking. drinking. performing.",
"", "", "out with my friends!"), essay8 = c("i am new to california and looking for someone to wisper my secrets to.",
"i am very open and will share just about anything.", "when i was five years old, i was known as \"the boogerman\".",
"", "", "i cried on my first day at school because a bird shat on my head. true story."
), essay9 = c("you want to be swept off your feet! you are tired of the norm. you want to catch a coffee or a bite. or if you
want to talk philosophy.",
"", "you are bright, open, intense, silly, ironic, critical, caring, generous, looking for an exploration, rather than finding \"a match\" of some predetermined qualities. i'm currently in a fabulous and open relationship, so you should be comfortable with that.",
"you feel so inclined.", "", "you're awesome.")), row.names = c(NA,
6L), class = "data.frame")
Use across() to apply the same function to multiple columns:
cupid %>%
mutate(across(starts_with("essay"), \(x) stringr::str_count(x, " +") + 1,
.names = "{.col}_num"))
# ...other column...
# essay0_num essay1_num essay2_num essay3_num essay4_num essay5_num essay6_num essay7_num
# 1 237 45 16 28 62 5 4 16
# 2 130 7 18 1 50 53 1 1
# 3 246 90 65 46 355 6 1 6
# 4 11 7 13 7 29 1 4 1
# 5 40 6 7 8 44 1 1 1
# 6 160 12 60 15 70 59 20 4
# essay8_num essay9_num
# 1 14 30
# 2 10 1
# 3 12 39
# 4 1 4
# 5 1 1
# 6 17 2
I simplified your word counting logic - splitting on spaces and looking at the length is the same as counting the spaces and adding 1. Using " +" as a regex pattern means consecutive spaces will be lumped together.
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
I have a collection of texts which are organised in a data frame in the following way:
I would need such texts to be organised in the following way
I have been through a lot of previous questions here, but all merging suggested includes calculations, something which is not the case here. I have also consulted Tidytext package but did not seem to find a function to merge text in this way.
Any help is appreciated.
Edit
A pice of the actual data frame would be:
dput(df1)
structure(list(Title = c("Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Immigrants five times better off in Britain - Daily Star", "Immigrants five times better off in Britain - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star",
"Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star"
), Content = c("IMMIGRANTS from Romania and Bulgaria would be five times better off if they moved to Britain.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"Related content", "And families with two kids would be nine times richer, according to shock new figures.",
"From 2014, the 29 million citizens of Romania and Bulgaria become eligible to live anywhere in Europe – and there are fears that millions will be heading to the UK.",
"Migration Watch UK says our minimum wage of £254 a week compares to an average £55 a week in those countries.",
"Chairman Sir Andrew Green said: “Given the incentives, it would be absurd to suggest that there will not be a significant inflow.”",
"US President-elect Donald Trump has reaffirmed plans to deport millions of illegal immigrants from America in a bold statement to the world.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"The 70-year-old billionaire will promise to tackle criminals who were illegally living in America in a broadcast due to be aired later this evening.",
"Appearing in his first tv interview since his shocking election win, Trump said that two to three million immigrants with criminal records in the US would either be jailed or deported.",
"He told CBS show 60 Minutes: \"What we are going to do is get the people that are criminal and have criminal records, gang members, drug dealers. where a lot of these people, probably two million, it could even be three million, we are getting them out of our country, they're here illegally.",
"\"After the border is secure and after everything gets normalised, we're going to make a determination on the people that they're talking about who are terrific people, they're terrific people, but we are gonna (sic) make a determination at that.",
"\"But before we make that determination, it's very important, we are going to secure our border.\"",
"Trump also confirmed plans were underway to construct a \"great wall\" on the US-Mexican border.",
"A spokeswoman for Mr Trump yesterday confirmed that the 70-year-old tycoon had set up a taskforce to begin plans on constructing the wall, which could cost as much as £9.3billion.",
"But the President-elect did concede that parts of the wall may have to be a fence.",
"When asked if he would accept a fence, Trump said: \"For certain areas I would, but certain areas, a wall is more appropriate. I’m very good at this, it’s called construction.\"",
"Congressman Louie Gohmert confirmed yesterday that Trump's wall would is only likely to stretch for “around half” the length of the border, which spans California, Arizona, New Mexico and Texas.",
"Plans to build the wall has seen widespread protests across the US, with demonstrators taking to the streets to protest about their new president.",
"Scores have been arrested and a man was shot in Portland, Oregon, following an argument between activists.",
"In Los Angeles, officers were scouring the route of an earlier protest after an undercover officer lost his gun and handcuffs during a scuffle.",
"THOUSANDS of immigrants are getting access to UK state handouts as soon as they arrive thanks to an EU loophole.",
"Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox!",
"Related content", "In the past five years 100,000 wives, husbands and children of EU citizens have moved to Britain under a lax system that bypasses rules for Brits.",
"British people who want close family from outside Europe to move to the UK have to prove they earn around £18,000 a year before they get visas.",
"But separate rules for EU citizens mean they do not have to bring in the same wages before flying in relatives. They then get the same right to benefits as unemployed Brits.",
"Sir Andrew Green, chairman of Migration Watch, said: “This is a loophole that must be closed.",
"“It is absurd that EU citizens should be in a more favourable position than our own citizens.”"
)), row.names = c(NA, -30L), class = c("tbl_df", "tbl", "data.frame"
))
Thanks
PS.: Sorry for the images, the system did not allow me to add actual tables.
We can use
aggregate(Text ~ Book, df1, FUN = paste, collapse =' ')
-output
Book Text
1 Book1 Text.a Text.b
2 Book2 Text.c Text.d
For the OP's data
aggregate( Content ~ Title, df1, FUN = paste, collapse =' ')
-output
Title
1 Donald Trump pledges to deport 3 MILLION illegal immigrants from the US - Daily Star
2 Immigrants five times better off in Britain - Daily Star
3 Thousands of immigrants get access to state handouts on arrival due to EU loophole - Daily Star
Content
1 US President-elect Donald Trump has reaffirmed plans to deport millions of illegal immigrants from America in a bold statement to the world. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! The 70-year-old billionaire will promise to tackle criminals who were illegally living in America in a broadcast due to be aired later this evening. Appearing in his first tv interview since his shocking election win, Trump said that two to three million immigrants with criminal records in the US would either be jailed or deported. He told CBS show 60 Minutes: "What we are going to do is get the people that are criminal and have criminal records, gang members, drug dealers. where a lot of these people, probably two million, it could even be three million, we are getting them out of our country, they're here illegally. "After the border is secure and after everything gets normalised, we're going to make a determination on the people that they're talking about who are terrific people, they're terrific people, but we are gonna (sic) make a determination at that. "But before we make that determination, it's very important, we are going to secure our border." Trump also confirmed plans were underway to construct a "great wall" on the US-Mexican border. A spokeswoman for Mr Trump yesterday confirmed that the 70-year-old tycoon had set up a taskforce to begin plans on constructing the wall, which could cost as much as £9.3billion. But the President-elect did concede that parts of the wall may have to be a fence. When asked if he would accept a fence, Trump said: "For certain areas I would, but certain areas, a wall is more appropriate. I’m very good at this, it’s called construction." Congressman Louie Gohmert confirmed yesterday that Trump's wall would is only likely to stretch for “around half” the length of the border, which spans California, Arizona, New Mexico and Texas. Plans to build the wall has seen widespread protests across the US, with demonstrators taking to the streets to protest about their new president. Scores have been arrested and a man was shot in Portland, Oregon, following an argument between activists. In Los Angeles, officers were scouring the route of an earlier protest after an undercover officer lost his gun and handcuffs during a scuffle.
2 IMMIGRANTS from Romania and Bulgaria would be five times better off if they moved to Britain. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! Related content And families with two kids would be nine times richer, according to shock new figures. From 2014, the 29 million citizens of Romania and Bulgaria become eligible to live anywhere in Europe – and there are fears that millions will be heading to the UK. Migration Watch UK says our minimum wage of £254 a week compares to an average £55 a week in those countries. Chairman Sir Andrew Green said: “Given the incentives, it would be absurd to suggest that there will not be a significant inflow.”
3 THOUSANDS of immigrants are getting access to UK state handouts as soon as they arrive thanks to an EU loophole. Don't miss a thing by getting the Daily Star's biggest headlines straight to your inbox! Related content In the past five years 100,000 wives, husbands and children of EU citizens have moved to Britain under a lax system that bypasses rules for Brits. British people who want close family from outside Europe to move to the UK have to prove they earn around £18,000 a year before they get visas. But separate rules for EU citizens mean they do not have to bring in the same wages before flying in relatives. They then get the same right to benefits as unemployed Brits. Sir Andrew Green, chairman of Migration Watch, said: “This is a loophole that must be closed. “It is absurd that EU citizens should be in a more favourable position than our own citizens.”
Or this can be done in tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(Title) %>%
summarise(Content = str_c(Content, collapse=" "), .groups = 'drop')
data
df1 <- structure(list(Book = c("Book1", "Book1", "Book2", "Book2"),
Text = c("Text.a", "Text.b", "Text.c", "Text.d")),
class = "data.frame", row.names = c(NA,
-4L))
My dataset looks like the following, and I followed Classification using Naive Bayes tutorial to develop my Naive bayes model for textmining However, I cannot predict the result of my naive bayes, even though model is built. The predict function is returning with 0 factor level. Below is my dataset and code so far.
**Dataset:**
lie sentiment review
f n 'Mike\'s Pizza High Point NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.'
f n 'i really like this buffet restaurant in Marshall street. they have a lot of selection of american japanese and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.'
f n 'After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .'
f n 'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
f n 'The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well never more. '
f n 'I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\'t acknowledge the coupon. When I asked her about it she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is otherwise my favorite place to dine. '
f n 'I went to ABC restaurant two days ago and I hated the food and the service. We were kept waiting for over an hour just to get seated and once we ordered our food came out cold. I ordered the pasta and it was terrible - completely bland and very unappatizing. I definitely would not recommend going there especially if you\'re in a hurry!'
f n 'I went to the Chilis on Erie Blvd and had the worst meal of my life. We arrived and waited 5 minutes for a hostess and then were seated by a waiter who was obviously in a terrible mood. We order drinks and it took them 15 minutes to bring us both the wrong beers which were barely cold. Then we order an appetizer and wait 25 minutes for cold southwest egg rolls at which point we just paid and left. Don\'t go.'
f n 'OMG. This restaurant is horrible. The receptionist did not greet us we just stood there and waited for five minutes. The food came late and served not warm. Me and my pet ordered a bowl of salad and a cheese pizza. The salad was not fresh the crust of a pizza was so hard like plastics. My dog didn\'t even eat that pizza. I hate this place!!!!!!!!!!'
dput(df)
> dput(head(lie))
structure(list(lie = c("f", "f", "f", "f", "f", "f"), sentiment = c("n",
"n", "n", "n", "n", "n"), review = c("Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza, not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.",
"i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.",
"After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes .",
"Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\\'t go to the Olive Oil Garden. ",
"The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well, never more. ",
"I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\\'t acknowledge the coupon. When I asked her about it, she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall, I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is, otherwise, my favorite place to dine. "
)), .Names = c("lie", "sentiment", "review"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000180788>)
R code:
library(gmodels)
lie<- fread('deception.csv',header = T,fill = T,quote = "\'")
str(lie)
lie
#Corpus Building
words.vec<- VectorSource(lie$review)
words.corpus<- Corpus(words.vec)
words.corpus<-tm_map(words.corpus,content_transformer(tolower)) #lower case
words.corpus<-tm_map(words.corpus,removePunctuation) # remove punctuation
words.corpus<-tm_map(words.corpus,removeNumbers) # remove numbers
words.corpus<-tm_map(words.corpus,removeWords,stopwords('english')) # remove stopwords
words.corpus<-tm_map(words.corpus,stripWhitespace) # remove unnecessary whitespace
#==========================================================================
#Document term Matrix
dtm<-DocumentTermMatrix(words.corpus)
dtm
class(dtm)
#dtm_df<-as.data.frame(as.matrix(dtm))
#class(dtm_df)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq,decreasing=TRUE)
freq[head(ord)]
freq[tail(ord)]
#===========================================================================
#Data frame partition
#Splitting DTM
dtm_train <- dtm[1:61, ]
dtm_test <- dtm[62:92, ]
train_labels <- lie[1:61, ]$lie
test_labels <-lie[62:92, ]$lie
str(train_labels)
str(test_labels)
prop.table(table(train_labels))
prop.table(table(test_labels))
freq_words <- findFreqTerms(dtm_train, 10)
freq_words
dtm_freq_train<- dtm_train[ , freq_words]
dtm_freq_test <- dtm_test[ , freq_words]
dtm_freq_test
convert_counts <- function(x) {
x <- ifelse(x > 0, 'yes','No')
}
train <- apply(dtm_freq_train, MARGIN = 2, convert_counts)
test <- apply(dtm_freq_test, MARGIN = 2, convert_counts)
str(test)
nb_classifier<-naiveBayes(train,train_labels)
nb_classifier
test_pred<-predict(nb_classifier,test)
Thanks in advance for help,
Naive Bayes requires the response variable as a categorical class variable:
Convert lie column of your lie data-frame to factorand re run analysis:
lie$lie <- as.factor(lie$lie)
I have a corpus of news articles on a given topic. Some of these articles are the exact same article but have been given additional headers and footers that very slightly change the content. I am trying to delete all but one of the potential duplicates so the final corpus only contains unique articles.
I decided to use cosine similiarity to identify the potential duplicates:
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim)
After looking at a subset of the data, I chose a cut off of .9 cosine distance or above to indicate duplicates.(I am okay with any error that this Given this, I have converted the diagonal to 0 (i.e., not a dup) and altered the matrix to indicate which documents are duplicates and which are not:
diag(cosinemat) <- 0
cosinemat[cosinemat >= .9] <- 1
cosinemat[cosinemat < .9] <- 0
The problem I'm running into is figuring out how to delete all but one of the duplicate documents. Initially, I envisioned a for loop to go through each column cell by cell, for any cell that has a value of 1 (i.e., is a duplicate) to delete the column with the same name as the row of the current cell, reconstitute the matrix and continue on to the next cell. The for loop doesn't seem to like the line of code that deletes the columns with the name of the current row when the cell is equal to 1. Though, I'm not sure its okay to reconstitute the object you're looping through. Something like this:
cosine_df <- as.data.frame(cosinemat)
for(col in 1:ncol(cosine_df)){
for(row in 1:nrow(cosine_df)){
if(cosine_df[col,row] == 0){
next
}
if(cosine_df[col,row] == 1){
cosine_df <- cosine_df[!rownames(cosine_df) %in% paste(rownames(cosine_df)[col,row]]
}
}
}
I'm not set on this approach, and I'm open to creative solutions, so long as I am able to identify similar documents and to delete all but one document.
Here's a subset of the documents if it helps:
docs <- structure(list(text_main = c("Congressional Documents and PublicationsMay 26, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:287 wordsBody(Washington, DC) Reps. Ted Deutch (D-FL) and Gus Bilirakis (R-FL) joined with Reps. Steve Israel (D-NY), Mike Kelly (R-PA), Ted Lieu (D-CA), Adam Kinzinger (R-IL), Hakeem Jeffries (D-NY), Lee Zeldin (R-NY), and Susan Davis (D-CA) to introduce a resolution (H. Res. 750) urging the European Union (EU) to designate the entirety of Hizballah as a terrorist organization and increase pressure on the organizations and its members. Currently, the EU only designates Hizballah's military wing as a terrorist organization, while the United States makes no distinction between its military and political branches when listing the group on its Foreign Terrorist Organization list.Upon introduction, the Members of Congress released the following statement:\"Hizballah is an Iranian-backed terrorist organization with a global reach that engages in significant illicit criminal activity to fund its terrorism. It doesn't matter what part of the organization you're associated with; if you are connected with Hizballah, you are contributing to the rocket attacks on innocent Israeli civilians, targeted bombings of Jews around the world, slaughter of civilians in Syria, and destabilization of the Middle East. There is no distinction between parts of Hizballah when every part contributes to terrorism. We urge our EU allies to help rein in Hizballah's dangerous worldwide activities.\"The resolution can be viewed here .Last year, Congress passed the Hizballah International Financing Prevention Act which tightened sanctions on Hizballah's criminal and financial networks.Read this original document at: ",
"Congressional Documents and PublicationsApril 20, 2016Copyright 2016 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:499 wordsBodyToday, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here .Read this original document at: ",
"Targeted News ServiceApril 20, 2016 Wednesday 7:41 AM ESTCopyright 2016 Targeted News Service LLC All Rights ReservedLength:511 wordsByline:Targeted News ServiceDateline:WASHINGTON BodyRep. Ted Deutch, D-Fla. (21st CD), issued the following news release:Today, members of the House of Representatives Bipartisan Taskforce for Combating Anti-Semitism sounded the alarm about a troubling surge in anti-Semitism on American college campuses. In a letter to the Secretary of Education, the Taskforce asked the Secretary about the Department's planned response to the issue. Additionally, the co-chairs made the following statement:\"An alarming rise of anti-Israel programs on American college campuses contribute to increasing harassment, intimidation, and discrimination against Jewish students. While we believe that students' freedoms of speech and assembly should be respected, there are increasing reports that activity advertised as anti-Israel or anti-Zionist is devolving into displays of subtle, but sometimes outright anti-Semitism. Attacks on students because of their actual or perceived religion, ancestry, or ethnicity are unacceptable. We believe strongly that no student should ever face discrimination and that school activities must be structured in a respectful manner to ensure academic integrity and a nondiscriminatory environment throughout the entire campus. For these reasons, we ask the Department of Education to assess its ability to monitor and respond to anti-Semitic incidents and to take additional steps to combat intimidation and harassment against minority students on college campuses.\"In 2004, the U.S. Department of Education Office for Civil Rights (OCR) clarified its interpretation of Title VI of the Civil Rights Act of 1964, including protections for groups of students on the basis of their actual or perceived shared ancestry or ethnic characteristics, regardless of whether they are members of a faith community, as in the case for Jewish, Sikh, and Muslim students. The Department reiterated this policy again in 2010 and 2015.However, as the number of reported Boycott, Divestment, and Sanctions (BDS) movement campaigns and other anti-Israel initiatives rise on college campuses, Members of Congress believe the Department must proactively implement its anti-discrimination policy to mitigate anti-Semitism on college campuses.The Bipartisan Taskforce for Combating Anti-Semitism is co-chaired by U.S. Reps. Nita Lowey (D-NY), Chris Smith (R-NJ), Eliot Engel (D-NY), Ileana Ros-Lehtinen (R-FL), Kay Granger (R-TX), Steve Israel (D-NY), Peter Roskam (R-IL), and Ted Deutch (D-FL).The following organizations expressed their support for the letter: the Anti-Defamation League, Jewish Federation of North America, B'nai Brith International, Jewish United Fund/Jewish Federation of Metropolitan Chicago, the Louis D. Brandeis Center for Human Rights Under Law, the World Jewish Congress, and the Zionist Organization of America.Text of the letter can be found here ().Contact: Jason Attermann, 202/225-3001Copyright Targeted News Services30FurigayJof-5501453 30FurigayJof",
"US Official NewsFebruary 13, 2013 WednesdayCopyright 2013 Plus Media Solutions Private Limited All Rights ReservedLength:298 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release: Rep. Ted Deutch (D-FL) and Rep. Gus Bilirakis (R-GL) issued the following statements regarding the Bulgarian governments report that two individuals responsible for the July 2012 terrorist attack on a bus in Burgas, Bulgaria, have ties to Hezbollah. Five Israeli tourists and the Bulgarian bus driver were killed in the attack.Congressman Bilirakis: The Bulgarian governments report is yet another example of Hezbollah's deliberate use of terror across the globe. Contrary to some European opinions, Hezbollah is not merely a political organization and is actively involved in terrorist activities. As I have requested many times, the European Union must finally recognize Hezbollah for what it is: a terrorist organization. I commend the Bulgarian government for their thorough investigation and call on the members of the European Union to examine these findings closely.Congressman Deutch: The results of the Bulgarian governments investigation into the deadly attack in Burgas confirms what we already knew - Hezbollah is a terrorist organization that is willing to perpetrate attacks on innocent civilians around the globe. I continue to urge our European partners to formally designate Hezbollah as a terrorist organization. Failure to do so only emboldens Hezbollah to continue its reign of terror in Europe and around the world.In September 2012, Congressmen Bilirakis and Deutch initiated a bi-partisan letter signed by 268 Members of Congress to the President and Ministers of the Commission of the European Union, urging them to include Hezbollah on the European Union's list of terrorist organizations. For further information please visit: ",
"Congressional Documents and PublicationsMay 4, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:204 wordsBodyWashington, May 4 -Rep. Ted Deutch released the following statement on the Florida legislature's passage of SB 444, which expands upon the Protecting Florida's Investments Act, legislation he authored in 2007 in the Florida State Senate:\"I applaud the Florida Legislature's passage of SB 444, legislation that will help ensure national and international security by preventing Florida's taxpayer dollars from supporting companies who choose to violate federal law by bolstering the Iranian regime. I congratulate the bill's sponsors, Sen. Ellyn Bogdanoff and Rep. Mack Bernard. This bill prevents state and local governments from awarding contracts to companies found to be investing in the Iranian energy sector. It is consistent with federal policy and sends a clear message that Floridians will not support any company that puts profit over international security. The Iranian regime continues to pursue its illicit nuclear weapons program, continues to engage in the most egregious human rights violations, and continues to support terrorism across the globe. We must continue to utilize every economic tool at our disposal to bring this regime to its knees. I urge Governor Scott to act quickly to sign this bill into law.\"",
"Congressional Documents and PublicationsMarch 23, 2011Copyright 2011 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:128 wordsBodyBoca Raton, Mar 23 -Congressman Ted Deutch (D-FL) released the following statement in reaction to the explosion of a bomb today in Jerusalem that killed a 59-year-old woman and injured dozens more:\"Today's horrific bombing in Jerusalem is yet another attack in a surge of violence perpetuated by Palestinian terrorists against innocent Israeli citizens,\" said Congressman Ted Deutch. \"The victims of this heinous attack and the Israeli people deserve the full support of the international community as they seek to defend themselves against this relentless violence. It is deplorable that as Israelis endure this latest bombing in Jerusalem, as well as ongoing rocket attacks by Hamas, some astonishingly still seek to blame Israel for the lack of peace in the region.\"",
"States News ServiceMarch 26, 2015 ThursdayCopyright 2015 States News ServiceLength:218 wordsByline:States News ServiceDateline:WASHINGTON BodyThe following information was released by the office of Florida Rep. Ted Deutch:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"",
"Congressional Documents and PublicationsMarch 26, 2015Copyright 2015 Federal Information and News Dispatch, Inc.Section:U.S. HOUSE OF REPRESENTATIVES DOCUMENTSLength:250 wordsBodyCongressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums. In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Florida's 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Today's deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\"For a fact sheet on H.R. 2, please go to: .Read this original document at: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: ",
"US Official NewsMarch 27, 2015 FridayCopyright 2015 Plus Media Solutions Private Limited All Rights ReservedLength:241 wordsDateline:Washington Body Office of the House of Representative Ted Deutch, U.S Government has issued the following news release:Congressman Ted Deutch (FL-21) issued the following statement after voting in favor of the bipartisan 'doc fix' deal included in H.R. 2, the Medicare Access and CHIP Reauthorization Act, which passed the House of Representatives by a vote of 392-27:\"Today I was pleased to vote for a bipartisan deal to permanently repeal the misguided Sustainable Growth Rate (SGR) policy that jeopardizes doctors participating in Medicare, and to permanently extend assistance to low-income seniors struggling to pay their Medicare premiums.In addition to building certainty in our Medicare system, this legislation also protects health care for low-income kids and funding for the federal health centers in Floridas 21st district that in 2013 alone served over 70,000 patients living below or near the federal poverty level.\"Todays deal is the product of the kind of bipartisan compromise that has become all too rare in Washington, in which neither Democrats nor Republicans get everything they want but come together to move our country forward. I can only imagine how much more we could achieve for the American people if we came together in the spirit of compromise on more issues, from infrastructure investment to immigration reform.\" In case of any query regarding this article or other content needs please contact: "
)), row.names = c(NA, 10L), class = "data.frame", .Names = "text_main")
Here is the matrix of similarity for the same subset of documents:
cosine_df <- structure(list(text1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text2 = c(0,
0, 1, 0, 0, 0, 0, 0, 0, 0), text3 = c(0, 1, 0, 0, 0, 0, 0, 0,
0, 0), text4 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), text5 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), text6 = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0), text7 = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1), text8 = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 1), text9 = c(0, 0, 0, 0, 0, 0, 1, 1,
0, 1), text10 = c(0, 0, 0, 0, 0, 0, 1, 1, 1, 0)), .Names = c("text1",
"text2", "text3", "text4", "text5", "text6", "text7", "text8",
"text9", "text10"), row.names = c("text1", "text2", "text3",
"text4", "text5", "text6", "text7", "text8", "text9", "text10"
), class = "data.frame")
In case anyone else has a similar problem, this was the solution I ended up creating:
library(quanteda)
myDfm <- dfm(as.character(docs$text_main), verbose=FALSE)
cosinesim <- textstat_simil(x=myDfm, selection=docnames(myDfm), margin="documents", method="cosine")
cosinemat <- as.matrix(cosinesim) #this produces a matrix of the document similarities
threshold <- .9
similar_indices <- unique(apply(cosinemat, 1,
function(x) which(x > threshold)))
## keep only the first element of each set
if(class(similar_indices) == "list") { # check if list or not
unique_indices <- unique(sapply(similar_indices, function(x) as.numeric(x[1])))
} else if (class(similar_indices) == "matrix"){
unique_indices <- unique(apply(similar_indices, 2, function(x) as.numeric(x[1])))
} else {
unique_indices <- similar_indices
}
## get only the unique texts
docs_unique <- docs[unique_indices ,]
I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df.
So far I've tried the following
exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
mycorpus <- corpus(exampletext)
mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob"))
mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob"))
mycorpus.kwic <- mycorpus.nat + mycorpus.cit
mydfm <- dfm(mycorpus.kwic)
This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents.
Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose:
require(quanteda)
txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
toks <- tokens(txt)
mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5))
mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5))
Please make sure that you are using the latest version of Quanteda.