R: How to Prepare Data for LDA/Text Analysis - r

I am working with the R programming language.
I would like to perform BTM (Bitopic Term Model - a variant of LDA (Latent Dirichlet Analysis) for small text datasets) on some text data. I am following this tutorial over here: https://cran.r-project.org/web/packages/BTM/readme/README.html
When I look at the dataset ("brussels_reviews_anno") being used in this tutorial, it look something like this (I can not recognize the format of this data!):
library(udpipe)
library(BTM)
data("brussels_reviews_anno", package = "udpipe")
head(brussels_reviews_anno)
doc_id language sentence_id token_id token lemma upos xpos
1 32198807 es 1 1 Gwen gwen NOUN NNP
2 32198807 es 1 2 fue ser VERB VB
3 32198807 es 1 3 una un DET DT
4 32198807 es 1 4 magnifica magnifica NOUN NN
5 32198807 es 1 5 anfitriona anfitriono ADJ JJ
6 32198807 es 1 6 . . PUNCT .
My dataset ("my_data") is in the current format - I manually create a text dataset for this example using reviews of fast food restaurants found on the internet:
my_data = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
Can someone please show me how I can take my dataset and transform it in such a way that I can perform BTM analysis on this data and create a visualization similar to the visualizations in this tutorial?
Thanks!
Additional References:
https://rforanalytics.com/11-7-topic-modelling.html

The class of brussels_reviews_anno is just a regular data.frame. That structure is generated by the function udpipe() from the package udpipe.
Below I provide a working example, with the exclusion of the path where I save the language model, that shows how to replicate a similar data structure.
Please keep in mind that udpipe() does a lot of stuff. The reason why you see many more columns in the final data.frame out is because I did not tweak any parameters of the function nor simply deleted any of the columns.
Overall, to get started with BTM() you need to tokenize your textual data. That's one of the things you can do with the package udpipe.
Hope this helped!
library(udpipe)
library(BTM)
data("brussels_reviews_anno", package = "udpipe")
head(brussels_reviews_anno)
#> doc_id language sentence_id token_id token lemma upos xpos
#> 1 32198807 es 1 1 Gwen gwen NOUN NNP
#> 2 32198807 es 1 2 fue ser VERB VB
#> 3 32198807 es 1 3 una un DET DT
#> 4 32198807 es 1 4 magnifica magnifica NOUN NN
#> 5 32198807 es 1 5 anfitriona anfitriono ADJ JJ
#> 6 32198807 es 1 6 . . PUNCT .
my_data = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
# download a language model
udpipe_download_model("english-ewt", model_dir = "~/Desktop/")
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to ~/Desktop//english-ewt-ud-2.5-191206.udpipe
#> - This model has been trained on version 2.5 of data from https://universaldependencies.org
#> - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#> - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#> - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '~/Desktop//english-ewt-ud-2.5-191206.udpipe'
#> language file_model
#> 1 english-ewt ~/Desktop//english-ewt-ud-2.5-191206.udpipe
#> url
#> 1 https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe
#> download_failed download_message
#> 1 FALSE OK
# load in the environment
eng_model = udpipe_load_model("~/Desktop/english-ewt-ud-2.5-191206.udpipe")
# apply the tokenization
out = udpipe(my_data$reviews, object = eng_model)
head(out)
#> doc_id paragraph_id sentence_id
#> 1 doc1 1 1
#> 2 doc1 1 1
#> 3 doc1 1 1
#> 4 doc1 1 1
#> 5 doc1 1 1
#> 6 doc1 1 1
#> sentence
#> 1 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 2 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 3 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 4 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 5 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> 6 I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave.
#> start end term_id token_id token lemma upos xpos
#> 1 1 1 1 1 I I PRON PRP
#> 2 3 7 2 2 guess guess VERB VBP
#> 3 9 11 3 3 the the DET DT
#> 4 13 20 4 4 employee employee NOUN NN
#> 5 22 28 5 5 decided decide VERB VBD
#> 6 30 31 6 6 to to PART TO
#> feats head_token_id dep_rel deps misc
#> 1 Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj <NA> <NA>
#> 2 Mood=Ind|Tense=Pres|VerbForm=Fin 0 root <NA> <NA>
#> 3 Definite=Def|PronType=Art 4 det <NA> <NA>
#> 4 Number=Sing 5 nsubj <NA> <NA>
#> 5 Mood=Ind|Tense=Past|VerbForm=Fin 2 ccomp <NA> <NA>
#> 6 <NA> 7 mark <NA> <NA>
Created on 2022-09-20 by the reprex package (v2.0.1)

Related

Convert results into a dataframe from function

From this results:
library(stm)
labelTopics(gadarianFit, n = 15)
Topic 1 Top Words:
Highest Prob: immigr, illeg, legal, border, will, need, worri, work, countri, mexico, life, better, nation, make, worker
FREX: border, mexico, mexican, need, concern, fine, make, better, worri, nation, deport, worker, will, econom, poor
Lift: cross, racism, happen, other, continu, concern, deport, mexican, build, fine, econom, border, often, societi, amount
Score: immigr, border, need, will, mexico, illeg, mexican, worri, concern, legal, nation, fine, worker, better, also
Topic 2 Top Words:
Highest Prob: job, illeg, tax, pay, american, take, care, welfar, crime, system, secur, social, health, cost, servic
FREX: cost, health, servic, welfar, increas, loss, school, healthcar, job, care, medic, crime, social, violenc, educ
Lift: violenc, expens, opportun, cost, healthcar, loss, increas, gang, servic, medic, health, diseas, terror, school, lose
Score: job, welfar, crime, cost, tax, care, servic, increas, health, pay, school, loss, medic, healthcar, social
Topic 3 Top Words:
Highest Prob: peopl, come, countri, think, get, english, mani, live, citizen, learn, way, becom, speak, work, money
FREX: english, get, come, mani, back, becom, like, think, new, send, right, way, just, live, peopl
Lift: anyth, send, still, just, receiv, deserv, back, new, english, mani, get, busi, year, equal, come
Score: think, peopl, come, get, english, countri, mani, speak, way, send, back, money, becom, learn, live
How is it possible to keep the results from highest propability into a dataframe with number of columns equal to the number of topic and rows equal to the number of words per topic (n = 15)
Example of expected output:
topic1 topic2 topic3
immigr job peopl
illeg illeg come
In the labelTopics object, words are stored under prob. So you could try something like this:
library(stm)
topics <- labelTopics(gadarianFit, n=15)
topics <- data.frame(t(topics$prob))
colnames(topics) <- paste0("topic", 1:ncol(topics))
topics
#> topic1 topic2 topic3
#> 1 immigr job peopl
#> 2 illeg illeg come
#> 3 legal tax countri
#> 4 border pay think
#> 5 will american get
#> 6 need take english
#> 7 worri care mani
#> 8 work welfar live
#> 9 countri crime citizen
#> 10 mexico system learn
#> 11 life secur way
#> 12 better social becom
#> 13 nation health speak
#> 14 make cost work
#> 15 worker servic money
Note that stm offers several ways of selecting the most important words per topic, including "Frex", "Lift". You would simply have to change the prob in my code to use those.
Type this to see them:
topics <- labelTopics(gadarianFit, n=15)
str(topics)

Naive Bayes model NOT predicting anything on applying model- Predict function returning with 0 factor level

My dataset looks like the following, and I followed Classification using Naive Bayes tutorial to develop my Naive bayes model for textmining However, I cannot predict the result of my naive bayes, even though model is built. The predict function is returning with 0 factor level. Below is my dataset and code so far.
**Dataset:**
lie sentiment review
f n 'Mike\'s Pizza High Point NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.'
f n 'i really like this buffet restaurant in Marshall street. they have a lot of selection of american japanese and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.'
f n 'After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .'
f n 'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
f n 'The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well never more. '
f n 'I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\'t acknowledge the coupon. When I asked her about it she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is otherwise my favorite place to dine. '
f n 'I went to ABC restaurant two days ago and I hated the food and the service. We were kept waiting for over an hour just to get seated and once we ordered our food came out cold. I ordered the pasta and it was terrible - completely bland and very unappatizing. I definitely would not recommend going there especially if you\'re in a hurry!'
f n 'I went to the Chilis on Erie Blvd and had the worst meal of my life. We arrived and waited 5 minutes for a hostess and then were seated by a waiter who was obviously in a terrible mood. We order drinks and it took them 15 minutes to bring us both the wrong beers which were barely cold. Then we order an appetizer and wait 25 minutes for cold southwest egg rolls at which point we just paid and left. Don\'t go.'
f n 'OMG. This restaurant is horrible. The receptionist did not greet us we just stood there and waited for five minutes. The food came late and served not warm. Me and my pet ordered a bowl of salad and a cheese pizza. The salad was not fresh the crust of a pizza was so hard like plastics. My dog didn\'t even eat that pizza. I hate this place!!!!!!!!!!'
dput(df)
> dput(head(lie))
structure(list(lie = c("f", "f", "f", "f", "f", "f"), sentiment = c("n",
"n", "n", "n", "n", "n"), review = c("Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza, not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.",
"i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.",
"After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes .",
"Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\\'t go to the Olive Oil Garden. ",
"The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well, never more. ",
"I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\\'t acknowledge the coupon. When I asked her about it, she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall, I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is, otherwise, my favorite place to dine. "
)), .Names = c("lie", "sentiment", "review"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000180788>)
R code:
library(gmodels)
lie<- fread('deception.csv',header = T,fill = T,quote = "\'")
str(lie)
lie
#Corpus Building
words.vec<- VectorSource(lie$review)
words.corpus<- Corpus(words.vec)
words.corpus<-tm_map(words.corpus,content_transformer(tolower)) #lower case
words.corpus<-tm_map(words.corpus,removePunctuation) # remove punctuation
words.corpus<-tm_map(words.corpus,removeNumbers) # remove numbers
words.corpus<-tm_map(words.corpus,removeWords,stopwords('english')) # remove stopwords
words.corpus<-tm_map(words.corpus,stripWhitespace) # remove unnecessary whitespace
#==========================================================================
#Document term Matrix
dtm<-DocumentTermMatrix(words.corpus)
dtm
class(dtm)
#dtm_df<-as.data.frame(as.matrix(dtm))
#class(dtm_df)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq,decreasing=TRUE)
freq[head(ord)]
freq[tail(ord)]
#===========================================================================
#Data frame partition
#Splitting DTM
dtm_train <- dtm[1:61, ]
dtm_test <- dtm[62:92, ]
train_labels <- lie[1:61, ]$lie
test_labels <-lie[62:92, ]$lie
str(train_labels)
str(test_labels)
prop.table(table(train_labels))
prop.table(table(test_labels))
freq_words <- findFreqTerms(dtm_train, 10)
freq_words
dtm_freq_train<- dtm_train[ , freq_words]
dtm_freq_test <- dtm_test[ , freq_words]
dtm_freq_test
convert_counts <- function(x) {
x <- ifelse(x > 0, 'yes','No')
}
train <- apply(dtm_freq_train, MARGIN = 2, convert_counts)
test <- apply(dtm_freq_test, MARGIN = 2, convert_counts)
str(test)
nb_classifier<-naiveBayes(train,train_labels)
nb_classifier
test_pred<-predict(nb_classifier,test)
Thanks in advance for help,
Naive Bayes requires the response variable as a categorical class variable:
Convert lie column of your lie data-frame to factorand re run analysis:
lie$lie <- as.factor(lie$lie)

How to retrieve movies' genres from wikidata using R

I would like to retrieve information from wikidata and store it in a dataframe. For the sake of simplicity I am going to assume that I want to get the genre of the following movies and then filter those that belong to sci-fi:
movies = c("Star Wars Episode IV: A New Hope", "Interstellar",
"Happythankyoumoreplease")
I know there is a package called WikidataR. If I am not wrong, and according to its vignettes there are two commands that may be useful: find_item and find_property allow you to retrieve a set of Wikidata items or properties where the aliase or descriptions match a particular search term. Apparently they are great for me, so I thought of doing something like
for (i in movies) {
info = find_item(i)
}
This is what I get from each item:
> find_item("Interstellar")
Wikidata item search
Number of results: 10
Results:
1 Interstellar (Q13417189) - 2014 US science fiction film
2 Interstellar (Q6057099)
3 interstellar medium (Q41872) - matter and fields (radiation) that exist in the space between the star systems in a galaxy;includes gas in ionic, atomic or molecular form, dust and cosmic rays. It fills interstellar space and blends smoothly into the surrounding intergalactic space
4 space colonization (Q686876) - concept of permanent human habitation outside of Earth
5 rogue planet (Q167910) - planetary-mass object that orbits the galaxy directly
6 interstellar cloud (Q1054444) - accumulation of gas, plasma and dust in a galaxy
7 interstellar travel (Q834826) - term used for hypothetical manned or unmanned travel between stars
8 Interstellar Boundary Explorer (Q835898)
9 starship (Q2003852) - spacecraft designed for interstellar travel
10 interstellar object (Q2441216) - astronomical object in interstellar space, such as a comet
>
Unfortunately, the information that I get from find_item (see below) has two problems:
it is not a dataframe with all wikidata information of the item I
am searching but a list of what seems to be metadata (wikidata's id,
link...).
it does not have the information I need (wikidata's
properties from each particular wikidata item).
Similarly, find_property provides metadata of a certain property. find_property("genre") retrieves the following information:
> find_property("genre")
Wikidata property search
Number of results: 4
Results:
1 genre (P136) - a creative work's genre or an artist's field of work (P101). Use main subject (P921) to relate creative works to their topic
2 radio format (P415) - describes the overall content broadcast on a radio station
3 sex or gender (P21) - sexual identity of subject: male (Q6581097), female (Q6581072), intersex (Q1097630), transgender female (Q1052281), transgender male (Q2449503). Animals: male animal (Q44148), female animal (Q43445). Groups of same gender use "subclass of" (P279)
4 gender of a scientific name of a genus (P2433) - determines the correct form of some names of species and subdivisions of species, also subdivisions of a genus
This has similar problems:
it is not a dataframe
it just stores metadata about the property
I don't find any way to link each property with each object in movies vector.
Is there any way to end up with a dataframe containing the genre's of those movies? (or a dataframe with all wikidata's information which I will have to manipulate in order to filter or select my desired data?)
These are just lists. you can get a picture with str(find_item("Interstellar")) for example.
Then you can go through each element of the list and pick the item that you need. For example. Getting the title and the label
a <- find_item("Interstellar")
b <- Reduce(rbind,lapply(a, function(x) cbind(x$title,x$label)))
data.frame(b)
## X1 X2
## 1 Q13417189 Interstellar
## 2 Q6057099 Interstellar
## 3 Q41872 interstellar medium
## 4 Q686876 space colonization
## 5 Q167910 rogue planet
## 6 Q1054444 interstellar cloud
## 7 Q834826 interstellar travel
## 8 Q835898 Interstellar Boundary Explorer
## 9 Q2003852 starship
## 10 Q2441216 interstellar object
This works easily for regular data if some element is missing then you will have to handle it for example some items don't have description. So you can get around with the following.
Reduce("rbind",lapply(a,
function(x) cbind(x$title,
x$label,
ifelse(length(x$description)==0,NA,x$description))))

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

Searching one .CSV File with the Columns of another .CSV File

I'm working with two Excel files. One is a very large (~5 to 6 GB) data set. This is from the government's Open Payments Data, which is free and open to everyone to view. It is a file that lists all of the disclosed payments from Industry to physicians, and it is open to the public (which is why I took a screenshot).
The second Excel file I'm working with is also large, but it's a file that lists the disclosed payments from physicians at a particular institution.
My goals:
I'd like to 'filter' out the Open Payments Data to just include the physicians I have in my second Excel file. Is there any way to do that? The Open Payments Data is inconsistent and has some uppercase and lower case.
What I've done so far:
I've been able to parse out the Open Payments Data to just include the state of the physicians I'm looking for. I've also imported both of these .csv files into R and named them accordingly.
I'm taking a course in R right now but it's been no help ... and most of the answers I've found online are for smaller sets of data. The data I'm working with has ~500,000 rows! Thank you in advance for your insight.
Edit: This is head(mydata)
Physician_Profile_ID Physician_First_Name
1 377519 KELLI
2 377519 KELLI
3 377519 KELLI
4 272641 ABDUL
5 272641 ABDUL
6 272641 ABDUL
Physician_Middle_Name Physician_Last_Name
1 A AABY
2 A AABY
3 A AABY
4 A AADAM
5 A AADAM
6 AADAM
Physician_Name_Suffix
1
2
3
4
5
6
Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 APOLLO ENDOSURGERY INC
5 APOLLO ENDOSURGERY INC
6 BOSTON SCIENTIFIC CORPORATION
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
1 BioHorizons Implant Systems Inc.
2 BioHorizons Implant Systems Inc.
3 BioHorizons Implant Systems Inc.
4 Apollo Endosurgery Inc
5 APOLLO ENDOSURGERY INC
6 Boston Scientific Corporation
Total_Amount_of_Payment_USDollars Date_of_Payment
1 11.55 6/17/2014
2 187.50 6/4/2014
3 222.24 5/23/2014
4 60.20 5/4/2014
5 110.15 7/28/2014
6 12.36 12/10/2014
Form_of_Payment_or_Transfer_of_Value
1 In-kind items and services
2 In-kind items and services
3 In-kind items and services
4 In-kind items and services
5 In-kind items and services
6 In-kind items and services
Nature_of_Payment_or_Transfer_of_Value City_of_Travel
1 Food and Beverage
2 Gift
3 Education
4 Food and Beverage
5 Food and Beverage
6 Food and Beverage
State_of_Travel Country_of_Travel
1
2
3
4
5
6
And this is head(institution_data, 2):
DB.ID Last.Name First.Name
1 12345 Johnson John
2 12354 Twain Mark
Names have been changed for confidentiality. DB ID != Physician_ID unfortunately.
A list (vector actually) of physician IDs could be constructed:
PHY_ID <- unique(
institution_data$DB.ID[ institution_data$DB.ID %in% mydata$Physician_Profile_ID ] )
Then extract the data from the main file using the matches to that vector:
chargedata <- mydata[ mydata$Physician_Profile_ID %in% PHY_ID , ]
Could also use match with the same logic but the %in% function uses match "under the hood" and code written with %in% is generally easier to read. If the ID's were not supposed to match, which you should have stated if that were the case, then name matching could be attempted but it would make sense to add additional criteria, such as state of nearby zipcode.

Resources