How do i remove a specific term in my dataframe string? - r

df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
table() %>%
sort(decreasing = TRUE)
Hey guys, im using these functions to get a list of word frequency (im working with a giant text), but im getting repeated words like "banana" , "banana.", "banana?" etc. and they are counting separately. How do i delete the dots, interrogation and others to sum banana correctly? Thx!!!

Try using :
df <- dataframe$Data %>%
na.omit() %>%
tolower() %>%
strsplit(split = " ") %>%
unlist() %>%
gsub('[[:punct:]]', '', .) %>%
table() %>%
sort(decreasing = TRUE)

Related

summing up the first 5 elements of a list

I have a data frame that contains a column with varying numbers of integer values. I need to take the first five of these values and sum them up. I found a way to do it for one, but can't seem to generalize it to loop through all:
Here is the code for the first element:
results$occupied[1] %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
And what does not work for all elements:
results %>%
rowwise() %>%
select(occupied) %>%
as.character() %>%
strsplit(",") %>%
as.list() %>%
unlist() %>%
head(5) %>%
as.numeric() %>%
sum()
In base R, you can do :
sapply(strsplit(results$occupied, ","), function(x) sum(as.numeric(head(x, 5))))
Or using dplyr and purrr
library(dplyr)
library(purrr)
results %>%
mutate(total_sum = map_dbl(strsplit(occupied, ","),
~sum(as.numeric(head(.x, 5)))))
Similarly, using rowwise :
results %>%
rowwise() %>%
mutate(total_sum = sum(as.numeric(head(strsplit(occupied, ",")[[1]], 5))))
We can use separate_rows to split the 'occupied' column and expand the rows, then do a group by row number and get the sum of the first five elements
library(dplyr)
library(tidyr)
results %>%
mutate(rn = row_number()) %>%
separate_rows(occupied, convert = TRUE) %>%
group_by(rn) %>%
slice(seq_len(5)) %>%
summmarise(total_sum = sum(occupied)) %>%
select(-rn) %>%
bind_cols(results, .)

How to calculate p.value of each column in a data frame with NA values using shapiro.test in r?

This is what I have tried so far. It works, but it only tells me the p.value of the data that has no NA's. Much of my data has NA values in a few places up to 1/3rd of the data.
normal <- apply(cor_phys, 2, function(x) shapiro.test(x)$p.value)
I want to try adding na.rm to the function, but it's not working. Help?
#calculate the correlations between all variables
corres <- cor_phys %>% #cor_phys is my data
as.matrix %>%
cor(use="complete.obs") %>% #complete.obs does not use NA
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1)
#removes duplicates correlations
corres <- corres %>%
mutate(var_order = paste(var1, var2) %>%
strsplit(split = ' ') %>%
map_chr( ~ sort(.x) %>%
paste(collapse = ' '))) %>%
mutate(cnt = 1) %>%
group_by(var_order) %>%
mutate(cumsum = cumsum(cnt)) %>%
filter(cumsum != 2) %>%
ungroup %>%
select(-var_order, -cnt, -cumsum) #removes unneeded columns
I did not write this myself, but it is the answer that I used and worked for my needs. The link to the page I used is: How to compute correlations between all columns in R and detect highly correlated variables

Error: `by` required, because the data sources have no common variables

I am trying to apply the codes to my data in this link
https://www.tidytextmining.com/sentiment.html#sentiment-analysis-with-inner-join
The code in the book is
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
I wrote it like the following (excluded "filter" because I have just filenames and words columns in my data)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
abc %>%
inner_join(nrc_joy ) %>%
count(word, sort = TRUE)
I get this error:
Error: by required, because the data sources have no common variables
Any ideas how to deal with it?
After running into a similar issue this is what I found.
The complete code from the website is:
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
The 'abc' dataset is unspecified in the question; however, it is easy to make up a substitute dataset using a 'differentColumnNameForWord'.
library(tidytext)
abc <- data.frame(differentColumnNameForWord = stop_words$word, stringsAsFactors = FALSE)
The way to find which column name the words are stored in the data frame is to use the 'names' function.
> names(abc)
[1] "DifferentColumnNameForWord"
Once the name of the column is identified the code would need to be modified as follows:
abc %>% inner_join(nrc_joy, by = c("DifferentColumnNameForWord" = "word")) %>%
count(DifferentColumnNameForWord, sort = TRUE)
In my situation, one dataset had the words under the 'word' column while another had the words under the 'term' column.

R - Count with tidytext data

I'm working on text mining with some Freud books from the Gutenberg project. When I try to do a sentiment analysis, using following code:
library(dplyr)
library(tidytext)
library(gutenbergr)
freud_books <- gutenberg_download(c(14969, 15489, 34300, 35875, 35877, 38219, 41214), meta_fields = "title")
tidy_books <- freud_books %>%
unnest_tokens(word, text)
f_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), by = "word") %>%
count(title, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
I get the error:
Error in mutate_impl(.data, dots) : Evaluation error: non-numeric
argument to binary operator.
I can see that the problem is in the last block, in the count function. Any help with this?
you should mutate line to your data after using the inner_join function because it's not column of your data so if you need it you have to create it yourself
pay attention to the mutate(line = row_number()) part, you can modify it if you need another way of assigning line numbers and then you can use index = line %/% 80 in count
try this:
library(dplyr)
library(tidytext)
library(gutenbergr)
freud_books <- gutenberg_download(c(14969, 15489, 34300, 35875, 35877, 38219, 41214),
meta_fields = "title")
tidy_books <- freud_books %>%
unnest_tokens(word, text)
f_sentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), by = "word") %>%
mutate(line = row_number()) %>%
count(title, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)

R tidytext stop_words are not filtering consistently from gutenbergr downloads

This is a bizarre puzzle. I downloaded 2 texts from gutenbergr - Alice in Wonderland and Ulysses.
The stop_words disappear from Alice but they are still in Ulysses.
This issue persisted even when replacing anti_join with
filter (!word %in% stop_words$word).
How do I get the stop_words out of Ulysses?
Thanks for your help!
Plot of top 15 tf_idf for Alice & Ulysses
library(gutenbergr)
library(dplyr)
library(stringr)
library(tidytext)
library(ggplot2)
titles <- c("Alice's Adventures in Wonderland", "Ulysses")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = c("title", "author"))
data(stop_words)
tidy_books <- books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
plot_tidy_books <- tidy_books %>%
bind_tf_idf(word, title, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
mutate(title = factor(title, levels = unique(title)))
plot_tidy_books %>%
group_by(title) %>%
arrange(desc(n))%>%
top_n(15, tf_idf) %>%
mutate(word=reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf, fill=title)) +
geom_col(show.legend = FALSE) +
labs(x=NULL, y="tf-idf") +
facet_wrap(~title, ncol=2, scales="free") +
coord_flip()
After a bit of digging in the tokenized Ulysses, the text "it's" is actually using a right single quotation mark instead of an apostrophe. stop_words in tidytext uses an apostrophe. You have to replace the right single quotation with an apostrophe.
I found this out by:
> utf8ToInt('it’s')
[1] 105 116 8217 115
Googling the 8217 lead me to here. From there it's as easy as grabbing the C++/Java source \u2019 and adding a mutate and gsub statement prior to your anti-join.
tidy_books <- books %>%
unnest_tokens(word, text) %>%
mutate(word = gsub("\u2019", "'", word)) %>%
anti_join(stop_words) %>%
count(title, word, sort=TRUE) %>%
ungroup()
Results in:

Resources