Convert results into a dataframe from function - r

From this results:
library(stm)
labelTopics(gadarianFit, n = 15)
Topic 1 Top Words:
Highest Prob: immigr, illeg, legal, border, will, need, worri, work, countri, mexico, life, better, nation, make, worker
FREX: border, mexico, mexican, need, concern, fine, make, better, worri, nation, deport, worker, will, econom, poor
Lift: cross, racism, happen, other, continu, concern, deport, mexican, build, fine, econom, border, often, societi, amount
Score: immigr, border, need, will, mexico, illeg, mexican, worri, concern, legal, nation, fine, worker, better, also
Topic 2 Top Words:
Highest Prob: job, illeg, tax, pay, american, take, care, welfar, crime, system, secur, social, health, cost, servic
FREX: cost, health, servic, welfar, increas, loss, school, healthcar, job, care, medic, crime, social, violenc, educ
Lift: violenc, expens, opportun, cost, healthcar, loss, increas, gang, servic, medic, health, diseas, terror, school, lose
Score: job, welfar, crime, cost, tax, care, servic, increas, health, pay, school, loss, medic, healthcar, social
Topic 3 Top Words:
Highest Prob: peopl, come, countri, think, get, english, mani, live, citizen, learn, way, becom, speak, work, money
FREX: english, get, come, mani, back, becom, like, think, new, send, right, way, just, live, peopl
Lift: anyth, send, still, just, receiv, deserv, back, new, english, mani, get, busi, year, equal, come
Score: think, peopl, come, get, english, countri, mani, speak, way, send, back, money, becom, learn, live
How is it possible to keep the results from highest propability into a dataframe with number of columns equal to the number of topic and rows equal to the number of words per topic (n = 15)
Example of expected output:
topic1 topic2 topic3
immigr job peopl
illeg illeg come

In the labelTopics object, words are stored under prob. So you could try something like this:
library(stm)
topics <- labelTopics(gadarianFit, n=15)
topics <- data.frame(t(topics$prob))
colnames(topics) <- paste0("topic", 1:ncol(topics))
topics
#> topic1 topic2 topic3
#> 1 immigr job peopl
#> 2 illeg illeg come
#> 3 legal tax countri
#> 4 border pay think
#> 5 will american get
#> 6 need take english
#> 7 worri care mani
#> 8 work welfar live
#> 9 countri crime citizen
#> 10 mexico system learn
#> 11 life secur way
#> 12 better social becom
#> 13 nation health speak
#> 14 make cost work
#> 15 worker servic money
Note that stm offers several ways of selecting the most important words per topic, including "Frex", "Lift". You would simply have to change the prob in my code to use those.
Type this to see them:
topics <- labelTopics(gadarianFit, n=15)
str(topics)

Related

R: How to Prepare Data for LDA/Text Analysis

I am working with the R programming language.
I would like to perform BTM (Bitopic Term Model - a variant of LDA (Latent Dirichlet Analysis) for small text datasets) on some text data. I am following this tutorial over here: https://cran.r-project.org/web/packages/BTM/readme/README.html
When I look at the dataset ("brussels_reviews_anno") being used in this tutorial, it look something like this (I can not recognize the format of this data!):
library(udpipe)
library(BTM)
data("brussels_reviews_anno", package = "udpipe")
head(brussels_reviews_anno)
doc_id language sentence_id token_id token lemma upos xpos
1 32198807 es 1 1 Gwen gwen NOUN NNP
2 32198807 es 1 2 fue ser VERB VB
3 32198807 es 1 3 una un DET DT
4 32198807 es 1 4 magnifica magnifica NOUN NN
5 32198807 es 1 5 anfitriona anfitriono ADJ JJ
6 32198807 es 1 6 . . PUNCT .
My dataset ("my_data") is in the current format - I manually create a text dataset for this example using reviews of fast food restaurants found on the internet:
my_data = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
Can someone please show me how I can take my dataset and transform it in such a way that I can perform BTM analysis on this data and create a visualization similar to the visualizations in this tutorial?
Thanks!
Additional References:
https://rforanalytics.com/11-7-topic-modelling.html
The class of brussels_reviews_anno is just a regular data.frame. That structure is generated by the function udpipe() from the package udpipe.
Below I provide a working example, with the exclusion of the path where I save the language model, that shows how to replicate a similar data structure.
Please keep in mind that udpipe() does a lot of stuff. The reason why you see many more columns in the final data.frame out is because I did not tweak any parameters of the function nor simply deleted any of the columns.
Overall, to get started with BTM() you need to tokenize your textual data. That's one of the things you can do with the package udpipe.
Hope this helped!
library(udpipe)
library(BTM)
data("brussels_reviews_anno", package = "udpipe")
head(brussels_reviews_anno)
#> doc_id language sentence_id token_id token lemma upos xpos
#> 1 32198807 es 1 1 Gwen gwen NOUN NNP
#> 2 32198807 es 1 2 fue ser VERB VB
#> 3 32198807 es 1 3 una un DET DT
#> 4 32198807 es 1 4 magnifica magnifica NOUN NN
#> 5 32198807 es 1 5 anfitriona anfitriono ADJ JJ
#> 6 32198807 es 1 6 . . PUNCT .
my_data = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
# download a language model
udpipe_download_model("english-ewt", model_dir = "~/Desktop/")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to ~/Desktop//english-ewt-ud-2.5-191206.udpipe
#> - This model has been trained on version 2.5 of data from https://universaldependencies.org
#> - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#> - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#> - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '~/Desktop//english-ewt-ud-2.5-191206.udpipe'
#> language file_model
#> 1 english-ewt ~/Desktop//english-ewt-ud-2.5-191206.udpipe
#> url
#> 1 https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe
#> download_failed download_message
#> 1 FALSE OK
# load in the environment
eng_model = udpipe_load_model("~/Desktop/english-ewt-ud-2.5-191206.udpipe")
# apply the tokenization
out = udpipe(my_data$reviews, object = eng_model)
head(out)
#> doc_id paragraph_id sentence_id
#> 1 doc1 1 1
#> 2 doc1 1 1
#> 3 doc1 1 1
#> 4 doc1 1 1
#> 5 doc1 1 1
#> 6 doc1 1 1
#> sentence
#> 1 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 2 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 3 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 4 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 5 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 6 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> start end term_id token_id token lemma upos xpos
#> 1 1 1 1 1 I I PRON PRP
#> 2 3 7 2 2 guess guess VERB VBP
#> 3 9 11 3 3 the the DET DT
#> 4 13 20 4 4 employee employee NOUN NN
#> 5 22 28 5 5 decided decide VERB VBD
#> 6 30 31 6 6 to to PART TO
#> feats head_token_id dep_rel deps misc
#> 1 Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj <NA> <NA>
#> 2 Mood=Ind|Tense=Pres|VerbForm=Fin 0 root <NA> <NA>
#> 3 Definite=Def|PronType=Art 4 det <NA> <NA>
#> 4 Number=Sing 5 nsubj <NA> <NA>
#> 5 Mood=Ind|Tense=Past|VerbForm=Fin 2 ccomp <NA> <NA>
#> 6 <NA> 7 mark <NA> <NA>
Created on 2022-09-20 by the reprex package (v2.0.1)

iterate over columns to count words in a sentence and put it in a new column

I have some columns titles essay 0-9, I want to iterate over them count the words and then make a new column with the number of words. so essay0 will get a column essay0_num with 5 if that is how many words it has in it.
so far i got cupid <- cupid %>% mutate(essay9_num = sapply(strsplit(essay9, " "), length))
to count the words and add a column but i don't want to do it one by one for all 10.
i tried a for loop:
for (i in 0:31) {
cupid <- cupid %>% mutate(xxx_num = sapply(strsplit(xxx, " "), length))
}
but i am not sure how iterate the columns in a for loop in R. I thought maybe i can pull out the columns i need and put them into a new df and use sapply somehow that way? but i still run into the problem of iterating over the columns.
dput:
dput(head(cupid))
structure(list(age = c(22L, 35L, 38L, 23L, 29L, 29L), status = c("single",
"single", "available", "single", "single", "single"), sex = c("m",
"m", "m", "m", "m", "m"), orientation = c("straight", "straight",
"straight", "straight", "straight", "straight"), body_type = c("a little extra",
"average", "thin", "thin", "athletic", "average"), diet = c("strictly anything",
"mostly other", "anything", "vegetarian", "", "mostly anything"
), drinks = c("socially", "often", "socially", "socially", "socially",
"socially"), drugs = c("never", "sometimes", "", "", "never",
""), education = c("working on college/university", "working on space camp",
"graduated from masters program", "working on college/university",
"graduated from college/university", "graduated from college/university"
), ethnicity = c("asian, white", "white", "", "white", "asian, black, other",
"white"), height = c(75, 70, 68, 71, 66, 67), income = c(-1L,
80000L, -1L, 20000L, -1L, -1L), job = c("transportation", "hospitality / travel",
"", "student", "artistic / musical / writer", "computer / hardware / software"
), last_online = c("2012-06-28-20-30", "2012-06-29-21-41", "2012-06-27-09-10",
"2012-06-28-14-22", "2012-06-27-21-26", "2012-06-29-19-18"),
location = c("south san francisco, california", "oakland, california",
"san francisco, california", "berkeley, california", "san francisco, california",
"san francisco, california"), offspring = c("doesn't have kids, but might want them",
"doesn't have kids, but might want them", "", "doesn't want kids",
"", "doesn't have kids, but might want them"), pets = c("likes dogs and likes cats",
"likes dogs and likes cats", "has cats", "likes cats", "likes dogs and likes cats",
"likes cats"), religion = c("agnosticism and very serious about it",
"agnosticism but not too serious about it", "", "", "", "atheism"
), sign = c("gemini", "cancer", "pisces but it doesn’t matter",
"pisces", "aquarius", "taurus"), smokes = c("sometimes",
"no", "no", "no", "no", "no"), speaks = c("english", "english (fluently), spanish (poorly), french (poorly)",
"english, french, c++", "english, german (poorly)", "english",
"english (fluently), chinese (okay)"), essay0 = c("about me: i would love to think that i was some some kind of intellectual: either the dumbest smart guy, or the smartest dumb guy. can't say i can tell the difference. i love to talk about ideas and concepts. i forge odd metaphors instead of reciting cliches. like the simularities between a friend of mine's house and an underwater salt mine. my favorite word is salt by the way (weird choice i know). to me most things in life are better as metaphors. i seek to
make myself a little better everyday, in some productively lazy way. got tired of tying my shoes. considered hiring a five year old, but would probably have to tie both of our shoes... decided to only wear leather shoes dress shoes. about you: you love to have really serious, really deep conversations about really silly stuff. you have to be willing to snap me out of a light hearted rant with a kiss. you don't have to be funny, but you have to be able to make me laugh. you should be able to bend spoons with your
mind, and telepathically make me smile while i am still at work. you should love life, and be cool with just letting the wind blow. extra points for reading all this and guessing my favorite video game (no hints given yet). and lastly you have a good attention span.",
"i am a chef: this is what that means. 1. i am a workaholic. 2. i love to cook regardless of whether i am at work. 3. i love to drink and eat foods that are probably really bad for me. 4. i love being around people that resemble line 1-3. i love the outdoors and i am an avid skier. if its snowing i will be in tahoe at the very least. i am a very confident and friendly. i'm not interested in acting or being a typical guy. i have no time or patience for rediculous acts of territorial pissing. overall i am a very
likable easygoing individual. i am very adventurous and always looking forward to doing new things and hopefully sharing it with the right person.",
"i'm not ashamed of much, but writing public text on an online dating site makes me pleasantly uncomfortable. i'll try to be as earnest as possible in the noble endeavor of standing naked before the world. i've lived in san francisco for 15 years, and both love it and find myself frustrated with its deficits. lots of great friends and acquaintances (which increases my apprehension to put anything on this site), but i'm feeling like meeting some new people that aren't just friends of friends. it's okay if you are a friend of a friend too. chances are, if you make it through the complex filtering process of multiple choice questions, lifestyle statistics, photo scanning, and these indulgent blurbs of text without moving quickly on to another search result, you are probably already a cultural peer and at most 2 people removed. at first, i thought i should say as little as possible here to avoid
you, but that seems silly. as far as culture goes, i'm definitely more on the weird side of the spectrum, but i don't exactly wear it on my sleeve. once you get me talking, it will probably become increasingly apparent that while i'd like to think of myself as just like everybody else (and by some definition i certainly am), most people don't see me that way. that's fine with me. most of the people i find myself gravitating towards are pretty weird themselves. you probably are too.",
"i work in a library and go to school. . .", "hey how's it going? currently vague on the profile i know, more to come soon. looking to meet new folks outside of my circle of friends. i'm pretty responsive on the reply tip, feel free to drop a line. cheers.",
"i'm an australian living in san francisco, but don't hold that against me. i spend most of my days trying to build cool stuff for my company. i speak mandarin and have been known to bust out chinese songs at karaoke. i'm pretty cheeky. someone asked me if that meant something about my arse, which i find really funny. i'm a little oddball. i have a wild imagination; i like to think
of the most improbable reasons people are doing things just for fun. i love to laugh and look for reasons to do so. occasionally this gets me in trouble because people think i'm laughing at them. sometimes i am, but more often i'm only laughing at myself. i'm an entrepreneur (like everyone else in sf, it seems) and i love what i do. i enjoy parties and downtime in equal measure. intelligence really turns me on and i love people who can teach me new things."
), essay1 = c("currently working as an international agent for a freight forwarding company. import, export, domestic you know the works. online classes and trying to better myself in my free time. perhaps a hours worth of a good book or a video game on a
lazy sunday.",
"dedicating everyday to being an unbelievable badass.", "i make nerdy software for musicians, artists, and experimenters to indulge in their own weirdness, but i like to spend time away from the computer when working on my artwork (which is typically more
concerned with group dynamics and communication, than with visual form, objects, or technology). i also record and deejay dance, noise, pop, and experimental music (most of which electronic or at least studio based). besides these relatively ego driven activities, i've been enjoying things like meditation and tai chi to try and gently flirt with ego death.",
"reading things written by old dead people", "work work work work + play",
"building awesome stuff. figuring out what's important. having adventures. looking for treasure."
), essay2 = c("making people laugh. ranting about a good salting. finding simplicity in complexity, and complexity in simplicity.",
"being silly. having ridiculous amonts of fun wherever. being a smart ass. ohh and i can cook. ;)",
"improvising in different contexts. alternating between being present and decidedly outside of a moment, or trying to hold both at once. rambling intellectual conversations that hold said conversations in contempt while seeking to find something that transcends them. being critical while remaining generous. listening to and using body language--often performed in caricature or large
gestures, if not outright interpretive dance. dry, dark, and raunchy humor.",
"playing synthesizers and organizing books according to the library of congress classification system",
"creating imagery to look at: http://bagsbrown.blogspot.com/ http://stayruly.blogspot.com/",
"imagining random shit. laughing at aforementioned random shit. being goofy. articulating what i think and feel. convincing people i'm right. admitting when i'm wrong. i'm also pretty good at helping people think through problems; my friends say i give good advice. and when i don't have a clue how to help, i will say: i give pretty good hug."
), essay3 = c("the way i look. i am a six foot half asian, half caucasian mutt. it makes it tough not to notice me, and for me to blend in.",
"", "my large jaw and large glasses are the physical things people comment on the most. when sufficiently stimulated, i have an unmistakable cackle of a laugh. after that, it goes in more directions than i care to describe right now. maybe i'll come back to this.",
"socially awkward but i do my best", "i smile a lot and my inquisitive nature",
"i have a big smile. i also get asked if i'm wearing blue-coloured contacts (no)."
), essay4 = c("books: absurdistan, the republic, of mice and men (only book that made me want to cry), catcher in the rye, the prince. movies: gladiator, operation valkyrie, the producers, down periscope. shows: the borgia, arrested development, game of
thrones, monty python music: aesop rock, hail mary mallon, george thorogood and the delaware destroyers, felt food: i'm down for anything.",
"i am die hard christopher moore fan. i don't really watch a lot of tv unless there is humor involved. i am kind of stuck on 90's alternative music. i am pretty much a fan of everything though... i do need to draw a line at most types of electronica.",
"okay this is where the cultural matrix gets so specific, it's like being in the crosshairs. for what it's worth, i find myself reading more non-fiction than fiction. it's usually some kind of philosophy, art, or science text by silly authors such as ranciere, de certeau, bataille, baudrillard, butler, stein, arendt, nietzche, zizek, etc. i'll often throw in some weird new age or pop-psychology book in the mix as well. as for fiction, i enjoy what little i've read of eco, perec, wallace, bolao, dick, vonnegut, atwood, delilo, etc. when i was young, i was a rabid asimov reader. directors i find myself drawn to are makavejev, kuchar, jodorowsky, herzog, hara, klein, waters, verhoeven, ackerman, hitchcock, lang, gorin, goddard, miike, ohbayashi, tarkovsky, sokurov, warhol, etc. but i also like a good amount of \"trashy\" stuff. too much to name. i definitely enjoy the character development that happens in long form episodic television over the course of 10-100 episodes, which a 1-2hr movie usually can't compete with. some of my recent tv favorites are: breaking bad, the wire, dexter, true blood, the prisoner, lost, fringe. a smattered sampling of
the vast field of music i like and deejay: art ensemble, sun ra, evan parker, lil wayne, dj funk, mr. fingers, maurizio, rob hood, dan bell, james blake, nonesuch recordings, omar souleyman, ethiopiques, fela kuti, john cage, meredith monk, robert ashley, terry riley, yoko ono, merzbow, tom tom club, jit, juke, bounce, hyphy, snap, crunk, b'more, kuduro, pop, noise, jazz, techno, house,
acid, new/no wave, (post)punk, etc. a few of the famous art/dance/theater folk that might locate my sensibility: andy warhol, bruce nauman, yayoi kusama, louise bourgeois, tino sehgal, george kuchar, michel duchamp, marina abramovic, gelatin, carolee schneeman, gustav metzger, mike kelly, mike smith, andrea fraser, gordon matta-clark, jerzy grotowski, samuel beckett, antonin artaud, tadeusz kantor, anna halperin, merce cunningham, etc. i'm clearly leaving out a younger generation of contemporary artists, many of whom are friends. local food regulars: sushi zone, chow, ppq, pagolac, lers ros, burma superstar, minako, shalimar, delfina pizza, rosamunde, arinells, suppenkuche, cha-ya, blue plate, golden era, etc.",
"bataille, celine, beckett. . . lynch, jarmusch, r.w. fassbender. . . twin peaks & fishing w/ john joy division, throbbing gristle, cabaret voltaire. . . vegetarian pho and coffee",
"music: bands, rappers, musicians at the moment: thee oh sees. forever: wu-tang books: artbooks for days audiobooks: my collection, thick (thanks audible) shows: live ones food: with stellar friends whenever movies > tv podcast: radiolab, this american life, the moth, joe rogan, the champs",
"books: to kill a mockingbird, lord of the rings, 1984, the farseer trilogy. music: the beatles, frank sinatra, john mayer, jason mraz, deadmau5, andrew bayer, everything on anjunadeep records, bach, satie. tv shows: how i met your mother, scrubs, the west wing, breaking bad. movies: star wars, the godfather pt ii, 500 days of summer, napoleon dynamite, american beauty, lotr food: thai, vietnamese, shanghai dumplings, pizza!"
), essay5 = c("food. water. cell phone. shelter.", "delicious porkness in all of its glories. my big ass doughboy's sinking into 15 new inches. my overly resilient liver. a good sharp knife. my ps3... it plays blurays too. ;) my over the top energy and my
outlook on life... just give me a bag of lemons and see what happens. ;)",
"movement conversation creation contemplation touch humor",
"", "", "like everyone else, i love my friends and family, and need hugs, human contact, water and sunshine. let's take that as given. 1. something to build 2. something to sing 3. something to play on (my guitar would be first choice) 4. something to write/draw on 5. a big goal worth dreaming about 6. something to laugh at"
), essay6 = c("duality and humorous things", "", "", "cats and german philosophy",
"", "what my contribution to the world is going to be and/or should be. and what's for breakfast. i love breakfast."
), essay7 = c("trying to find someone to hang out with. i am down for anything except a club.",
"", "viewing. listening. dancing. talking. drinking. performing.",
"", "", "out with my friends!"), essay8 = c("i am new to california and looking for someone to wisper my secrets to.",
"i am very open and will share just about anything.", "when i was five years old, i was known as \"the boogerman\".",
"", "", "i cried on my first day at school because a bird shat on my head. true story."
), essay9 = c("you want to be swept off your feet! you are tired of the norm. you want to catch a coffee or a bite. or if you
want to talk philosophy.",
"", "you are bright, open, intense, silly, ironic, critical, caring, generous, looking for an exploration, rather than finding \"a match\" of some predetermined qualities. i'm currently in a fabulous and open relationship, so you should be comfortable with that.",
"you feel so inclined.", "", "you're awesome.")), row.names = c(NA,
6L), class = "data.frame")
Use across() to apply the same function to multiple columns:
cupid %>%
mutate(across(starts_with("essay"), \(x) stringr::str_count(x, " +") + 1,
.names = "{.col}_num"))
# ...other column...
# essay0_num essay1_num essay2_num essay3_num essay4_num essay5_num essay6_num essay7_num
# 1 237 45 16 28 62 5 4 16
# 2 130 7 18 1 50 53 1 1
# 3 246 90 65 46 355 6 1 6
# 4 11 7 13 7 29 1 4 1
# 5 40 6 7 8 44 1 1 1
# 6 160 12 60 15 70 59 20 4
# essay8_num essay9_num
# 1 14 30
# 2 10 1
# 3 12 39
# 4 1 4
# 5 1 1
# 6 17 2
I simplified your word counting logic - splitting on spaces and looking at the length is the same as counting the spaces and adding 1. Using " +" as a regex pattern means consecutive spaces will be lumped together.

Apply Sentimentr on Dataframe with Multiple Sentences in 1 String Per Row

I have a dataset where I am trying to get the sentiment by article. I have about 1000 articles. Each article is a string. This string has multiple sentences within it. I ideally would like to add another column that would summarise the sentiment for each article. Is there an efficient way to do this using dplyr?
Below is an example dataset with just 2 articles.
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
head(df)
So I have been looking up how to do this using the sentimentr package and created below. However, this only outputs each sentences' sentiment (I do this by doing a strsplit of .,) and I want to instead aggregate everything at the full article level after applying this strsplit.
library(sentimentr)
full<-df %>%
group_by(V4) %>%
mutate(V2 = strsplit(as.character(V4), "[.],")) %>%
unnest(V2) %>%
get_sentences() %>%
sentiment()
The desired output I am looking for is to simply add an extra column my df dataframe with a summary sum(sentiment) for each article.
Additional info based on answer below:
date<- as.Date(c('2020-06-24', '2020-06-24'))
text <- c('3 more cops recover as PNP COVID-19 infections soar to 519', 'QC suspends processing of PWD IDs after reports of abuse in issuance of cards')
link<- c('https://newsinfo.inquirer.net/1296981/3-more-cops-recover-as-pnps-covid-19-infections-soar-to-519,3,10,4,11,9,8', 'https://newsinfo.inquirer.net/1296974/qc-suspends-processing-of-pwd-ids-after-reports-of-abuse-in-issuance-of-cards')
V4 <-c('MANILA, Philippines — Three more police officers have recovered from the new coronavirus disease, increasing the total number of recoveries in the Philippine National Police to (PNP) 316., This developed as the total number of COVID-19 cases in the PNP rose to 519 with one new infection and nine deaths recorded., In a Facebook post on Wednesday, the PNP also recorded 676 probable and 876 suspects for the disease., PNP chief Gen. Archie Gamboa previously said the force would will intensify its health protocols among its personnel after recording a recent increase in deaths., The latest fatality of the ailment is a police officer in Cebu City, which is under enhanced community quarantine as COVID-19 cases continued to surge there., ATM, \r\n\r\nFor more news about the novel coronavirus click here.\r\nWhat you need to know about Coronavirus.\r\n\r\n\r\n\r\nFor more information on COVID-19, call the DOH Hotline: (02) 86517800 local 1149/1150.\r\n\r\n \r\n \r\n \r\n\r\n \r\n , The Inquirer Foundation supports our healthcare frontliners and is still accepting cash donations to be deposited at Banco de Oro (BDO) current account #007960018860 or donate through PayMaya using this link .',
'MANILA, Philippines — Quezon City will halt the processing of identification cards to persons with disability for two days starting Thursday, June 25, so it could tweak its guidelines after reports that unqualified persons had issued with the said IDs., In a statement on Wednesday, Quezon City Mayor Joy Belmonte said the suspension would the individual who issued PWD ID cards to six members of a family who were not qualified but who paid P2,000 each to get the IDs., Belmonte said the suspect, who is a local government employee, was already issued with a show-cause order to respond to the allegation., According to city government lawyer Nino Casimir, the suspect could face a grave misconduct case that could result in dismissal., The IDs are issued to only to persons qualified under the Act Expanding the Benefits and Privileges of Persons with Disability (Republic Act No. 10754)., The IDs entitle PWDs to a 20 percent discount and VAT exemption on goods and services., /atm')
df<-data.frame(date, text, link, V4)
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$e~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recover as P~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Three more police officers ~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends processing o~ https://newsinfo.inquirer.net/1296~ "MANILA, Philippines — Quezon City will halt the p~ 1 161 0.329 -0.174
Warning message:
Can't combine <sentiment_by> and <sentiment_by>; falling back to <data.frame>.
x Some attributes are incompatible.
i The author of the class should implement vctrs methods.
i See <https://vctrs.r-lib.org/reference/faq-error-incompatible-attributes.html>.
If you need the sentiment over the whole text, there is no need to split the text first into sentences, the sentiment functions take care of this. I replaced the ., in your text back to periods as this is needed for the sentiment functions. The sentiment functions recognizes "mr." as not being the end of a sentence. If you use get_sentences() first, you get the sentiment per sentence and not over the whole text.
The function sentiment_by handles the sentiment over the whole text and averages it nicely. Check help with the option for the averaging.function if you need to change this. The by part of the function can deal with any grouping you want to apply.
df %>%
group_by(V4) %>% # group by not really needed
mutate(V4 = gsub("[.],", ".", V4),
sentiment_score = sentiment_by(V4))
# A tibble: 2 x 5
# Groups: V4 [2]
date text link V4 sentiment_score$~ $word_count $sd $ave_sentiment
<date> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 2020-06-24 3 more cops recov~ https://newsinfo.inquire~ "MANILA, Philippines — Three~ 1 172 0.204 -0.00849
2 2020-06-24 QC suspends proce~ https://newsinfo.inquire~ "MANILA, Philippines — Quezo~ 1 161 0.329 -0.174

Naive Bayes model NOT predicting anything on applying model- Predict function returning with 0 factor level

My dataset looks like the following, and I followed Classification using Naive Bayes tutorial to develop my Naive bayes model for textmining However, I cannot predict the result of my naive bayes, even though model is built. The predict function is returning with 0 factor level. Below is my dataset and code so far.
**Dataset:**
lie sentiment review
f n 'Mike\'s Pizza High Point NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.'
f n 'i really like this buffet restaurant in Marshall street. they have a lot of selection of american japanese and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.'
f n 'After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .'
f n 'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
f n 'The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well never more. '
f n 'I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\'t acknowledge the coupon. When I asked her about it she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is otherwise my favorite place to dine. '
f n 'I went to ABC restaurant two days ago and I hated the food and the service. We were kept waiting for over an hour just to get seated and once we ordered our food came out cold. I ordered the pasta and it was terrible - completely bland and very unappatizing. I definitely would not recommend going there especially if you\'re in a hurry!'
f n 'I went to the Chilis on Erie Blvd and had the worst meal of my life. We arrived and waited 5 minutes for a hostess and then were seated by a waiter who was obviously in a terrible mood. We order drinks and it took them 15 minutes to bring us both the wrong beers which were barely cold. Then we order an appetizer and wait 25 minutes for cold southwest egg rolls at which point we just paid and left. Don\'t go.'
f n 'OMG. This restaurant is horrible. The receptionist did not greet us we just stood there and waited for five minutes. The food came late and served not warm. Me and my pet ordered a bowl of salad and a cheese pizza. The salad was not fresh the crust of a pizza was so hard like plastics. My dog didn\'t even eat that pizza. I hate this place!!!!!!!!!!'
dput(df)
> dput(head(lie))
structure(list(lie = c("f", "f", "f", "f", "f", "f"), sentiment = c("n",
"n", "n", "n", "n", "n"), review = c("Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza, not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.",
"i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.",
"After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes .",
"Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\\'t go to the Olive Oil Garden. ",
"The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well, never more. ",
"I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\\'t acknowledge the coupon. When I asked her about it, she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall, I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is, otherwise, my favorite place to dine. "
)), .Names = c("lie", "sentiment", "review"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000180788>)
R code:
library(gmodels)
lie<- fread('deception.csv',header = T,fill = T,quote = "\'")
str(lie)
lie
#Corpus Building
words.vec<- VectorSource(lie$review)
words.corpus<- Corpus(words.vec)
words.corpus<-tm_map(words.corpus,content_transformer(tolower)) #lower case
words.corpus<-tm_map(words.corpus,removePunctuation) # remove punctuation
words.corpus<-tm_map(words.corpus,removeNumbers) # remove numbers
words.corpus<-tm_map(words.corpus,removeWords,stopwords('english')) # remove stopwords
words.corpus<-tm_map(words.corpus,stripWhitespace) # remove unnecessary whitespace
#==========================================================================
#Document term Matrix
dtm<-DocumentTermMatrix(words.corpus)
dtm
class(dtm)
#dtm_df<-as.data.frame(as.matrix(dtm))
#class(dtm_df)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq,decreasing=TRUE)
freq[head(ord)]
freq[tail(ord)]
#===========================================================================
#Data frame partition
#Splitting DTM
dtm_train <- dtm[1:61, ]
dtm_test <- dtm[62:92, ]
train_labels <- lie[1:61, ]$lie
test_labels <-lie[62:92, ]$lie
str(train_labels)
str(test_labels)
prop.table(table(train_labels))
prop.table(table(test_labels))
freq_words <- findFreqTerms(dtm_train, 10)
freq_words
dtm_freq_train<- dtm_train[ , freq_words]
dtm_freq_test <- dtm_test[ , freq_words]
dtm_freq_test
convert_counts <- function(x) {
x <- ifelse(x > 0, 'yes','No')
}
train <- apply(dtm_freq_train, MARGIN = 2, convert_counts)
test <- apply(dtm_freq_test, MARGIN = 2, convert_counts)
str(test)
nb_classifier<-naiveBayes(train,train_labels)
nb_classifier
test_pred<-predict(nb_classifier,test)
Thanks in advance for help,
Naive Bayes requires the response variable as a categorical class variable:
Convert lie column of your lie data-frame to factorand re run analysis:
lie$lie <- as.factor(lie$lie)

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

Resources