Simulate Flipping French Fries in R - r

I always had this question since I was a kid:
Suppose you place 100 french fries on to a pan over a stove
For this problem, let's assume that each french fry can only have 2 "states" : "face up" or "face down"
Each french fry needs to be cooked for 1 minute on each side - if a french fry is cooked for more than 1 minute on any side, it is considered as burnt
You place the french fries on the pan and after one minute you shake the pan - some of the french fries get flipped in the air and land on the pan either "face up" or "face down", but some of the french fries never got flipped at all.
After another minute has passed, you shake the pan again.
For the sake of this question, let's assume that each time you shake the pan, each individual french fry has a 50% chance of getting flipped in the air, and the french fries that were flipped in the air have a 50% chance of landing "face down" or "face up".
Here is the question:
After 2 minutes, how many of the 100 french fries are perfectly cooked and how many french fries are burnt?
How many minutes need to pass until all french fries are guaranteed to have been cooked on both sides (even though many of them will be burnt)?
Using R, I tried to write a simulation for this scenario:
original_data = data.frame(id = 1:100, state = "start")
number_fries_selected_in_first_flip = sample(1:100, 1, replace=F)
fries_selected_in_first_flip = sample(1:100, number_fries_selected_in_first_flip, replace=F)
This is where I got stuck - if I could somehow "tag" the french fries that were selected, I could assign a "burnt/perfectly cooked" status to these french fries with 50% probability:
status <- c("perfectly cooked","burnt")
original_data$tagged_fries_status <- sample(status, number_fries_selected_in_first_flip, replace=TRUE, prob=c(0.5, 0.5))
If I could finish the simulation, I could extend the simulation for the second flip, third flip, etc. At the end of the simulation (e.g. after 5 flips), I could make a chart of the number of french fries that are burnt vs perfectly cooked. Then, I could repeat the simulation many times (e.g. 1000 times), and find out the average number of fries burnt/perfectly cooked.
Could someone please show me how to write this simulation?
Thank you!

Make dataframe with 100 fries and cumulative time on first side. Since time on second side is just total time - time on first side, you can apply any test you want to the final "time state".
dfrm <- data.frame(id=1:100, time=0)
set.seed(123) # good practice in simulation
#start
dfrm$time <- sample(c(0,60), 100, repl=TRUE)
# next minute
dfrm$time <- sample(c(0,60), 100, repl=TRUE)
sum(dfrm$time == 60)
# 54
You would have an exponentially decaying proportion of fries with time == to either zero or total time which would be the group which were "not done". Since the simulation is "discrete" there is a time when the probability of an remaining undone would be less that any specified amount but there would be difficulty with a guaranteed time.

Related

Convert results into a dataframe from function

From this results:
library(stm)
labelTopics(gadarianFit, n = 15)
Topic 1 Top Words:
Highest Prob: immigr, illeg, legal, border, will, need, worri, work, countri, mexico, life, better, nation, make, worker
FREX: border, mexico, mexican, need, concern, fine, make, better, worri, nation, deport, worker, will, econom, poor
Lift: cross, racism, happen, other, continu, concern, deport, mexican, build, fine, econom, border, often, societi, amount
Score: immigr, border, need, will, mexico, illeg, mexican, worri, concern, legal, nation, fine, worker, better, also
Topic 2 Top Words:
Highest Prob: job, illeg, tax, pay, american, take, care, welfar, crime, system, secur, social, health, cost, servic
FREX: cost, health, servic, welfar, increas, loss, school, healthcar, job, care, medic, crime, social, violenc, educ
Lift: violenc, expens, opportun, cost, healthcar, loss, increas, gang, servic, medic, health, diseas, terror, school, lose
Score: job, welfar, crime, cost, tax, care, servic, increas, health, pay, school, loss, medic, healthcar, social
Topic 3 Top Words:
Highest Prob: peopl, come, countri, think, get, english, mani, live, citizen, learn, way, becom, speak, work, money
FREX: english, get, come, mani, back, becom, like, think, new, send, right, way, just, live, peopl
Lift: anyth, send, still, just, receiv, deserv, back, new, english, mani, get, busi, year, equal, come
Score: think, peopl, come, get, english, countri, mani, speak, way, send, back, money, becom, learn, live
How is it possible to keep the results from highest propability into a dataframe with number of columns equal to the number of topic and rows equal to the number of words per topic (n = 15)
Example of expected output:
topic1 topic2 topic3
immigr job peopl
illeg illeg come
In the labelTopics object, words are stored under prob. So you could try something like this:
library(stm)
topics <- labelTopics(gadarianFit, n=15)
topics <- data.frame(t(topics$prob))
colnames(topics) <- paste0("topic", 1:ncol(topics))
topics
#> topic1 topic2 topic3
#> 1 immigr job peopl
#> 2 illeg illeg come
#> 3 legal tax countri
#> 4 border pay think
#> 5 will american get
#> 6 need take english
#> 7 worri care mani
#> 8 work welfar live
#> 9 countri crime citizen
#> 10 mexico system learn
#> 11 life secur way
#> 12 better social becom
#> 13 nation health speak
#> 14 make cost work
#> 15 worker servic money
Note that stm offers several ways of selecting the most important words per topic, including "Frex", "Lift". You would simply have to change the prob in my code to use those.
Type this to see them:
topics <- labelTopics(gadarianFit, n=15)
str(topics)

Naive Bayes model NOT predicting anything on applying model- Predict function returning with 0 factor level

My dataset looks like the following, and I followed Classification using Naive Bayes tutorial to develop my Naive bayes model for textmining However, I cannot predict the result of my naive bayes, even though model is built. The predict function is returning with 0 factor level. Below is my dataset and code so far.
**Dataset:**
lie sentiment review
f n 'Mike\'s Pizza High Point NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.'
f n 'i really like this buffet restaurant in Marshall street. they have a lot of selection of american japanese and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.'
f n 'After I went shopping with some of my friend we went to DODO restaurant for dinner. I found worm in one of the dishes .'
f n 'Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it and the waitor had no manners whatsoever. Don\'t go to the Olive Oil Garden. '
f n 'The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well never more. '
f n 'I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\'t acknowledge the coupon. When I asked her about it she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is otherwise my favorite place to dine. '
f n 'I went to ABC restaurant two days ago and I hated the food and the service. We were kept waiting for over an hour just to get seated and once we ordered our food came out cold. I ordered the pasta and it was terrible - completely bland and very unappatizing. I definitely would not recommend going there especially if you\'re in a hurry!'
f n 'I went to the Chilis on Erie Blvd and had the worst meal of my life. We arrived and waited 5 minutes for a hostess and then were seated by a waiter who was obviously in a terrible mood. We order drinks and it took them 15 minutes to bring us both the wrong beers which were barely cold. Then we order an appetizer and wait 25 minutes for cold southwest egg rolls at which point we just paid and left. Don\'t go.'
f n 'OMG. This restaurant is horrible. The receptionist did not greet us we just stood there and waited for five minutes. The food came late and served not warm. Me and my pet ordered a bowl of salad and a cheese pizza. The salad was not fresh the crust of a pizza was so hard like plastics. My dog didn\'t even eat that pizza. I hate this place!!!!!!!!!!'
dput(df)
> dput(head(lie))
structure(list(lie = c("f", "f", "f", "f", "f", "f"), sentiment = c("n",
"n", "n", "n", "n", "n"), review = c("Mike\\'s Pizza High Point, NY Service was very slow and the quality was low. You would think they would know at least how to make good pizza, not. Stick to pre-made dishes like stuffed pasta or a salad. You should consider dining else where.",
"i really like this buffet restaurant in Marshall street. they have a lot of selection of american, japanese, and chinese dishes. we also got a free drink and free refill. there are also different kinds of dessert. the staff is very friendly. it is also quite cheap compared with the other restaurant in syracuse area. i will definitely coming back here.",
"After I went shopping with some of my friend, we went to DODO restaurant for dinner. I found worm in one of the dishes .",
"Olive Oil Garden was very disappointing. I expect good food and good service (at least!!) when I go out to eat. The meal was cold when we got it, and the waitor had no manners whatsoever. Don\\'t go to the Olive Oil Garden. ",
"The Seven Heaven restaurant was never known for a superior service but what we experienced last week was a disaster. The waiter would not notice us until we asked him 4 times to bring us the menu. The food was not exceptional either. It took them though 2 minutes to bring us a check after they spotted we finished eating and are not ordering more. Well, never more. ",
"I went to XYZ restaurant and had a terrible experience. I had a YELP Free Appetizer coupon which could be applied upon checking in to the restaurant. The person serving us was very rude and didn\\'t acknowledge the coupon. When I asked her about it, she rudely replied back saying she had already applied it. Then I inquired about the free salad that they serve. She rudely said that you have to order the main course to get that. Overall, I had a bad experience as I had taken my family to that restaurant for the first time and I had high hopes from the restaurant which is, otherwise, my favorite place to dine. "
)), .Names = c("lie", "sentiment", "review"), class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000180788>)
R code:
library(gmodels)
lie<- fread('deception.csv',header = T,fill = T,quote = "\'")
str(lie)
lie
#Corpus Building
words.vec<- VectorSource(lie$review)
words.corpus<- Corpus(words.vec)
words.corpus<-tm_map(words.corpus,content_transformer(tolower)) #lower case
words.corpus<-tm_map(words.corpus,removePunctuation) # remove punctuation
words.corpus<-tm_map(words.corpus,removeNumbers) # remove numbers
words.corpus<-tm_map(words.corpus,removeWords,stopwords('english')) # remove stopwords
words.corpus<-tm_map(words.corpus,stripWhitespace) # remove unnecessary whitespace
#==========================================================================
#Document term Matrix
dtm<-DocumentTermMatrix(words.corpus)
dtm
class(dtm)
#dtm_df<-as.data.frame(as.matrix(dtm))
#class(dtm_df)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq,decreasing=TRUE)
freq[head(ord)]
freq[tail(ord)]
#===========================================================================
#Data frame partition
#Splitting DTM
dtm_train <- dtm[1:61, ]
dtm_test <- dtm[62:92, ]
train_labels <- lie[1:61, ]$lie
test_labels <-lie[62:92, ]$lie
str(train_labels)
str(test_labels)
prop.table(table(train_labels))
prop.table(table(test_labels))
freq_words <- findFreqTerms(dtm_train, 10)
freq_words
dtm_freq_train<- dtm_train[ , freq_words]
dtm_freq_test <- dtm_test[ , freq_words]
dtm_freq_test
convert_counts <- function(x) {
x <- ifelse(x > 0, 'yes','No')
}
train <- apply(dtm_freq_train, MARGIN = 2, convert_counts)
test <- apply(dtm_freq_test, MARGIN = 2, convert_counts)
str(test)
nb_classifier<-naiveBayes(train,train_labels)
nb_classifier
test_pred<-predict(nb_classifier,test)
Thanks in advance for help,
Naive Bayes requires the response variable as a categorical class variable:
Convert lie column of your lie data-frame to factorand re run analysis:
lie$lie <- as.factor(lie$lie)

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

row wise count the number of the words in a review text in an R dataframe

I want to count the number of words in each row:
Review_ID Review_Date Review_Content Listing_Title Star Hotel_Name
1 1/25/2016 I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back! Outstanding 5 Crosby Street Hotel
2 1/18/2016 We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return! Always perfect! 5 Crosby Street Hotel
I was thinking something like:
WordFreqRowWise %>%
rowwise() %>%
summarise(n = n())
To get the results something like..
Review_ID Review_Content total_Words Min_occrd_word Max Average
1 .... 230 great: 1 the: 25 total_unique/total_words in the row
But do not have idea, how can I do it....
Here is a method in base R using strsplit and sapply. Let's say the data is stored in a data.frame df and the reviews are stored in the variable Review_Content
# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")
# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)
In this instance, sapply will return a vector of the counts for each row.
Since the word count is now an object, you can perform analysis you want on it. Here are some examples:
summarize the distribution of word counts: summary(df$wordCount)
maximum word count: max(df$wordCount)
mean word count: mean(df$wordCount)
range of word counts: range(df$wordCount)
interquartile range of word counts: IQR(df$wordCount)
Adding to #lmo's answer above..
Below code will generate a dataframe that consists of all the words, row-wise, and their frequencies:
temp2 <- data.frame()
for (i in 1:length(temp)){
temp1 <- as.data.frame(table(temp[[i]]))
temp1$ID <- paste0("Row_", i)
temp2 <- rbind(temp2, temp1)
temp1 <- NULL
}

How to create a column and replace value

Question
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
Expected output
1 An artist impression of a star system is responsible for a nova.
1 The team from university of VYU focus on a class of compounds.
1 The young people was seen enjoying the foorball match.
2 Scientist have made a breakthrough and solved a decades- old mystery by revealing how a powerful.
2 Heart attacks more due to nurture than nature.
2 SA footballer Senzo Meyiwa shot dead to save girlfriend
The data is in the csv format and it has got around 1000 data points, numbers are in columns(1) and sentence are in column(2). I need to split the string and retain the row number for that particular sentence. Need your help to build the r code
Note: Number and the sentence are two different columns
I have tried this code to string split but i need code for row index
x$qwerty <- as.character(x$qwerty)
sa<-list(strsplit(x$qwerty,".",fixed=TRUE))[[1]]
s<-unlist(sa)
write.csv(s,"C:\\Users\\Suhas\\Desktop\\out23.csv")
One inconvenience of vectorization in R is that they operate from "inside" the vector. That is, they operate on the elements themselves, rather than the elements in the context of the vector. Therefore the user loses the innate ability to keep track of the index, i.e. where element being operated on was located in the original object.
The workaround is to generate the index separately. This is easy to achieve with seq_along, which is an optimized version of 1:length(qwerty). Then you can just paste the index and the results together. In your case, you'll obviously want to do the pasteing before you unlist.
If your dataset is as shown above, may be this helps. You can read from the file as readLines("file.txt")
lines <- readLines(n=7)
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
lines1 <- lines[lines!='']
indx <- grep("^\\d", lines1)
lines2 <- unlist(strsplit(lines1, '(?<=\\.)(\\b| )', perl=TRUE))
indx <- grepl("^\\d+$", lines2)
res <- unlist(lapply(split(lines2,cumsum(indx)),
function(x) paste(x[1], x[-1])), use.names=FALSE)
res
#[1] "1 An artist impression of a star system is responsible for a nova."
#[2] "1 The team from university of VYU focus on a class of compounds."
#[3] "1 The young people was seen enjoying the football match."
#[4] "2 Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful."
#[5] "2 Heart attacks more due to nurture than nature."
#[6] "2 SA footballer Senzo Meyiwa shot dead to save girlfriend"
If you want it as 2 column data.frame
dat <- data.frame(id=rep(lines2[indx],diff(c(which(indx),
length(indx)+1))-1), Col1=lines2[!indx], stringsAsFactors=FALSE)
head(dat,2)
# id Col1
#1 1 An artist impression of a star system is responsible for a nova.
#2 1 The team from university of VYU focus on a class of compounds.

Resources