I would like to count the number of English words in a string of text.
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")))
df.words
ID text
1 1 frog friend fresh frink foot
2 2 get give gint gobble
I'd like the final product to look like this:
ID text count
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
I'm guessing I'll have to first separate based on spaces and then reference the words against a dictionary?
Building on #r2evans suggestion of using strsplit() and using a random English word .txt file dictionary online, example is below. This solution probably might not scale well if you have a large number of comparisons because of the unnest step.
library(dplyr)
library(tidyr)
# text file with 479k English words ~4MB
dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")),
stringsAsFactors = FALSE)
df.words %>%
mutate(text2 = strsplit(text, split = "\\s")) %>%
unnest(text2) %>%
semi_join(dict, by = c("text2")) %>%
group_by(ID, text) %>%
summarise(count = length(text2))
Output
ID text count
<int> <chr> <int>
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
Base R alternative, using EJJ's great recommendation for dict:
sapply(strsplit(df.words$text, "\\s+"),
function(z) sum(z %in% dict$text2))
# [1] 4 3
I thought that this would be a clear winner in speed, but apparently doing sum(. %in% .) one at a time can be a little expensive. (It is slower with this data.)
Faster but not necessarily simpler:
words <- strsplit(df.words$text, "\\s+")
words <- sapply(words, `length<-`, max(lengths(words)))
found <- array(words %in% dict$text2, dim = dim(words))
colSums(found)
# [1] 4 3
It's a hair faster (~ 10-15%) than EJJ's solution, so likely only a good thing if you need to wring some performance out of it.
(Caveat: EJJ's is faster with this 2-row dataset. If the data is 1000x larger, then my first solution is a little faster, and my second solution is twice as fast. Benchmarks are benchmarks, though, don't optimize code beyond usability if speed/time is not a critical factor.)
Related
I created this function to count the maximum number of consecutive characters in a word.
max(rle(unlist(strsplit("happy", split = "")))$lengths)
The function works on individual words, but when I try to use the function within a mutate step it doesn't work. Here is the code that involves the mutate step.
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
num_vowels = get_count(word),
num_consec_char = max(rle(unlist(strsplit(word, split = "")))$lengths)
)
The variables num_letters and num_vowels work fine, but I get a 2 for every value of num_consec_char. I can't figure out what I'm doing wrong.
This command rle(unlist(strsplit(word, split = "")))$lengths is not vectorized and thus is operating on the entire list of words for each row thus the same result for each row.
You will need to use some type of loop (ie for, apply, purrr::map) to solve it.
library(dplyr)
library(tidytext)
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
output<- text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
# num_vowels = get_count(word),
)
output$num_consec_char<- sapply(output$word, function(word){
max(rle(unlist(strsplit(word, split = "")))$lengths)
})
output
# A tibble: 32 × 4
line word num_letters num_consec_char
<int> <chr> <int> <int>
1 1 the 3 1
2 1 most 4 1
3 1 pressing 8 2
4 1 of 2 1
5 1 those 5 1
6 1 issues 6 2
7 1 considering 11 1
I want to find the distribution of number of titles with 1 word, 2 words, 3 words, ... in my dataset "jnl.dt" in R.
one_word_title = 0
two_word_title = 0
three_word_title = 0
for (i in 1:x){
if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==1){one_word_title <- one_word_title+1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==2){two_word_title <- two_word_title+1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==3){three_word_title <- three_word_title+1}
}
one_word_title
two_word_title
three_word_title
Is there a way to find the distribution of number of titles with different number of words without hardcoding the number of words in title?
Instead of doing this for every word separately, you can do this together.
table(stringr::str_count(jnl.dt$`Full Title`, '\\w+'))
Here's a proposal somewhat tentative given the absence of reproducible data:
Let's assume you have this kind of data and titles:
df <- data.frame(titles = c("The Great Gatsby", "That's the Story of my Life", "Love Story", "Alice in Wonderland", "Harry Potter"))
To get the "distribution" of number of words in the titlesyou can do this:
library(dplyr)
library(stringr)
df %>%
mutate(N_w = str_count(titles, "\\S+")) %>%
group_by(N_w) %>%
summarise(Dist_N_w = n())
# A tibble: 3 x 2
N_w Dist_N_w
* <int> <int>
1 2 2
2 3 2
3 6 1
Note that using \\w+ and, respectively, \\S+ makes a difference: as the apostrophe is not contained in the \\w character class (for letter, digits, and the underscore) That's will be counted as 2 words. If you use \\S instead, which is a negative character class matching anything that is a whitespace (including actual whitespace and also new line and return characters etc.), the count for That's will be 1.
We may use unnest_tokens
library(tidytext)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, titles) %>%
count(rn) %>%
count(n)
I am very new to NLP. Please, don't judge me strictly.
I have got a very big data-frame on customers' feedback, my goal is to analyze feedbacks. I tokenized words in feedbacks, deleted stop-words (SMART). Now, I need to receive a table of most and less frequent used words.
The code looks like this:
library(tokenizers)
library(stopwords)
words_as_tokens <-
tokenize_words(dat$description,
stopwords = stopwords(language = "en", source = "smart"))
The dataframe looks like this: there are lots of feedbacks (variable "description") and customers by whom the feedbacks were given (each customer is not unique, they can be repeated). I want to receive a table with 3 columns: a) customer name b) word c) its frequency. This "ranking" should be in a decreasing order.
Try this
library(tokenizers)
library(stopwords)
library(tidyverse)
# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description,
tokenize_words,
stopwords = stopwords(language = "en", source = "smart")),
function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)
# tidyverse's job
df <- words_as_tokens %>%
bind_rows(, .id = "name") %>%
rename(word = x)
# output
df
# name word Freq
# 1 John experience 2
# 2 John word 2
# 3 John absolutely 1
# 4 John action 1
# 5 John amazon 1
# 6 John amazon.ae 1
# 7 John answering 1
# ....
# 42 Alex break 2
# 43 Alex nice 2
# 44 Alex times 2
# 45 Alex 8 1
# 46 Alex accent 1
# 47 Alex africa 1
# 48 Alex agents 1
# ....
Data
dat <- data.frame(name = c("John", "Alex"),
description = c("Unprecedented. The perfect word to describe Amazon. In every positive sense of that word! All because of one man - Jeff Bezos. What an entrepreneur! What a vision! This is from personal experience. Let me explain. I had given up all hope, after a horrible experience with Amazon.ae (formerly Souq.com) - due to a Herculean effort to get an order cancelled and the subsequent refund issued. I have never faced such a feedback-resistant team in my life! They were robotically answering my calls and sending me monotonous, unhelpful emails, followed by absolutely zero action!",
"Not only does Amazon have great products but their Customer Service for the most part is wonderful. Although most times you are outsourced to a different country, I personally have found that when I call it's either South Africa or Philippines and they speak so well, understand me and my NY accent and are quite nice. Let’s face it. Most times you are calling CS with a problem or issue. These agents have to listen to 8 hours of complaints so they themselves need a break. No matter how annoyed I am I try to be on my best behavior and as nice as can be because they too need a break with how nasty we as a society can be."), stringsAsFactors = F)
You can try with quanteda as well as follows:
library(quanteda)
library(quanteda.textstats)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus,
tolower = TRUE,
remove = stopwords(), # this removes English stopwords
remove_punct = TRUE, # this removes punctuation
remove_numbers = TRUE, # this removes digits
remove_symbol = TRUE, # this removes symbols
remove_url = TRUE ) # this removes urls
# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )
I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.
An example data file is available here:
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
I am running into a couple road blocks that I could use some input into how navigate.
1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.
Some additional corks to the data:
1) the asteriks are NOT necessary
2) the cognitive TOTAL will never have a value
I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.
Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.
Here is what I have tried:
# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
#docvarsfrom = "filenames",
dvsep = " ")
From here I do not know how to split the data, in my head I would do something like this
data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))
This of course, gives the correct column names but removes almost all the data I actually am interested in.
Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.
Sincerely,
Alex
Here is a way of converting that text file to a dataframe that you can do analysis on
library(tidyverse)
input <- read_lines('c:/temp/scores.txt')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
table <- left_join(scores,
header,
by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)
# A tibble: 10 x 6
index type Score `T-Score` row title
<chr> <chr> <chr> <chr> <int> <chr>
1 "1. " Line Bisection 9 53 3 Subtest/Section
2 "2. " Semantic Memory 8 51 4 Subtest/Section
3 "3. " Word Fluency 1 56* 5 Subtest/Section
4 "4. " Recognition Memory 40 59 6 Subtest/Section
5 "5. " Gesture Object Use 2 68 7 Subtest/Section
6 "6. " Arithmetic 5 49 8 Subtest/Section
7 "7. " Spoken Words 17 45* 14 Spoken Language
8 "9. " Spoken Sentences 25 53* 15 Spoken Language
9 "11. " Spoken Paragraphs 4 60 16 Spoken Language
10 "8. " Written Words 14 45* 20 Written Language
What is the source for the code at the link provided?
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.
This is what was accomplished before hitting a wall.
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))
head(df2)
When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.
If anything, it would good to know something about those unfamiliar data types.
Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.
library(stringr)
library(dplyr)
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, "Score" = str_extract(df$V2, "\\d+") )
df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$"))
df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))
df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))
df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*"))
df6 <- replace(df5, is.na(df5), "")
df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$")) # remove digits, new variable
df7$V1 <- NULL # remove variable
df7$V2 <- NULL # remove variable
df8 <- df7[, c(6,3,1,4,2,5)] # re-align variables
head(df8,15)
I have a simple problem, consider this example
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# A tibble: 2 x 1
mytext
<chr>
1 stackoverflow is pretty good my friend
2 but sometimes pretty bad as well
I want to count the number of times stackoverflow is near good. I use the following regex but it does not work.
dataframe %>% mutate(mycount = str_count(mytext,
regex('stackoverflow(?:\\w+){0,5}good', ignore_case = TRUE)))
# A tibble: 2 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 0
2 but sometimes pretty bad as well 0
Can someone tell me what am I missing here?
Thanks!
I had a bunch of trouble with this too and I'm still not sure why the things I was trying didn't work. But I'm only decent at regular expressions, not an expert. However, I was able to get it to work with lookback and lookforward.
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well',
'stackoverflow one two three four five six good',
'stackoverflow good'))
dataframe
dataframe %>% mutate(mycount = str_count(mytext,
regex('(?<=stackoverflow)\\s(?:\\w+\\s){0,5}(?=good)', ignore_case = TRUE)))
## A tibble: 4 x 2
# mytext mycount
# <chr> <int>
#1 stackoverflow is pretty good my friend 1
#2 but sometimes pretty bad as well 0
#3 stackoverflow one two three four five six good 0
#4 stackoverflow good 1
The corpus library makes this pretty easy:
library(corpus)
dataframe <- data.frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# find instances of 'stackoverflow'
loc <- text_locate(dataframe$mytext, "stackoverflow")
# count the number of times 'good' is within 5 tokens
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
# aggregate over text
count <- tapply(near_good, loc$text, sum, default = 0)
Conceptually, corpus treats text as a sequence of tokens. The library allows you to index these sequences using the text_sub() command. You can also change the definition of a token using a text_filter().
Here's an example that works the same way but ignores punctuation-only tokens:
corpus <- corpus_frame(text = c("Stackoverflow, is pretty (?) GOOD my friend!",
"But sometimes pretty bad as well"))
text_filter(corpus)$drop_punct <- TRUE
loc <- text_locate(corpus, "stackoverflow")
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
count <- tapply(near_good, loc$text, sum, default = 0)
I think I got it
dataframe %>%
mutate(mycount = str_count(mytext,
regex('stackoverflow\\W+(?:\\w+ ){0,5}good', ignore_case = TRUE)))
# A tibble: 4 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 1
2 but sometimes pretty bad as well 0
3 stackoverflow good good stackoverflow 1
4 stackoverflowgood 0
The key was adding the \W+ meta-character that matches anything between words.