finding distribution of words in R - r

I want to find the distribution of number of titles with 1 word, 2 words, 3 words, ... in my dataset "jnl.dt" in R.
one_word_title = 0
two_word_title = 0
three_word_title = 0
for (i in 1:x){
if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==1){one_word_title <- one_word_title+1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==2){two_word_title <- two_word_title+1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==3){three_word_title <- three_word_title+1}
}
one_word_title
two_word_title
three_word_title
Is there a way to find the distribution of number of titles with different number of words without hardcoding the number of words in title?

Instead of doing this for every word separately, you can do this together.
table(stringr::str_count(jnl.dt$`Full Title`, '\\w+'))

Here's a proposal somewhat tentative given the absence of reproducible data:
Let's assume you have this kind of data and titles:
df <- data.frame(titles = c("The Great Gatsby", "That's the Story of my Life", "Love Story", "Alice in Wonderland", "Harry Potter"))
To get the "distribution" of number of words in the titlesyou can do this:
library(dplyr)
library(stringr)
df %>%
mutate(N_w = str_count(titles, "\\S+")) %>%
group_by(N_w) %>%
summarise(Dist_N_w = n())
# A tibble: 3 x 2
N_w Dist_N_w
* <int> <int>
1 2 2
2 3 2
3 6 1
Note that using \\w+ and, respectively, \\S+ makes a difference: as the apostrophe is not contained in the \\w character class (for letter, digits, and the underscore) That's will be counted as 2 words. If you use \\S instead, which is a negative character class matching anything that is a whitespace (including actual whitespace and also new line and return characters etc.), the count for That's will be 1.

We may use unnest_tokens
library(tidytext)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, titles) %>%
count(rn) %>%
count(n)

Related

counting word frequency in a string across columns in R

I am trying to get a count of how many times each word appears total for every index of a column for my whole data set. The data can be found here:https://www.kaggle.com/tovarischsukhov/southparklines
My code is as follows:
SP = read.csv("All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)
Cartman = SP %>% group_by(Character) %>%
arrange(Season, Episode) %>%
filter(Character =="Cartman")
Cartman_text_tbl <- as_tibble(data.frame(uniqueID = 1:length(Cartman$Season),Cartman[1:length(Cartman$Season),]))
Cartman_text_tbl_words <- Cartman_text_tbl %>% select(uniqueID,Cartman$Line) %>%
unnest_tokens(word, Cartman$Line) %>% filter(str_detect(word,"^[a-z']+$")) %>%
group_by(uniqueID) %>% count(word)
When I run the last line of code I get this error:
Error in `select()`:
! Can't subset columns that don't exist.
x Columns `Yeah, go home you little dildo.\n`, `I know what it means!\n`, `I'm not telling you.\n`, `He-yeah, that's what Kyle's little brother is all right! Ow! \n`, `That's 'cause I was having these... bogus nightmares.\n`, etc. don't exist.
I did a project for a class a couple of years ago where the professor provided some similar code, I am trying to format this code off what was previously provided for me. If there is a better way to get a count that would be awesome to know about as well, otherwise a way to fix the error would be great. Additionally, each line ends with a "\n" I was wondering if its possible to remove those from every column? Thanks!
If I understand you correctly, I believe this may help you. The output gives you the count of each word said by Cartman for each episode and season. Of course for other characters you can use the same code and change the filter and object the output is assigned to. Also if you need to remove stop words you can add anti_join(stop_words, by = "word") %>% after the unnest_tokens() function. It is also set as sort = TRUE, so it will sort the words in descending order based on frequency, so you can change this and sort as needed.
Code:
library(tidyverse)
library(tidytext)
df <- read_csv("All-seasons.csv")
cartman <- df %>%
filter(Character == "Cartman") %>%
group_by(Season, Episode) %>%
unnest_tokens(output = word, input = Line) %>%
count(word, sort = TRUE)
Output Example:
> head(cartman)
# A tibble: 6 x 4
# Groups: Season, Episode [6]
Season Episode word n
<dbl> <dbl> <chr> <int>
1 7 11 you 73
2 11 8 i 73
3 5 4 you 66
4 16 7 you 63
5 14 8 i 61
6 11 2 i 60

Having difficulty using rle command within a mutate step in r to count the max number of consecutive characters in a word

I created this function to count the maximum number of consecutive characters in a word.
max(rle(unlist(strsplit("happy", split = "")))$lengths)
The function works on individual words, but when I try to use the function within a mutate step it doesn't work. Here is the code that involves the mutate step.
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
num_vowels = get_count(word),
num_consec_char = max(rle(unlist(strsplit(word, split = "")))$lengths)
)
The variables num_letters and num_vowels work fine, but I get a 2 for every value of num_consec_char. I can't figure out what I'm doing wrong.
This command rle(unlist(strsplit(word, split = "")))$lengths is not vectorized and thus is operating on the entire list of words for each row thus the same result for each row.
You will need to use some type of loop (ie for, apply, purrr::map) to solve it.
library(dplyr)
library(tidytext)
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
output<- text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
# num_vowels = get_count(word),
)
output$num_consec_char<- sapply(output$word, function(word){
max(rle(unlist(strsplit(word, split = "")))$lengths)
})
output
# A tibble: 32 × 4
line word num_letters num_consec_char
<int> <chr> <int> <int>
1 1 the 3 1
2 1 most 4 1
3 1 pressing 8 2
4 1 of 2 1
5 1 those 5 1
6 1 issues 6 2
7 1 considering 11 1

Count number of English words in string in R

I would like to count the number of English words in a string of text.
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")))
df.words
ID text
1 1 frog friend fresh frink foot
2 2 get give gint gobble
I'd like the final product to look like this:
ID text count
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
I'm guessing I'll have to first separate based on spaces and then reference the words against a dictionary?
Building on #r2evans suggestion of using strsplit() and using a random English word .txt file dictionary online, example is below. This solution probably might not scale well if you have a large number of comparisons because of the unnest step.
library(dplyr)
library(tidyr)
# text file with 479k English words ~4MB
dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")),
stringsAsFactors = FALSE)
df.words %>%
mutate(text2 = strsplit(text, split = "\\s")) %>%
unnest(text2) %>%
semi_join(dict, by = c("text2")) %>%
group_by(ID, text) %>%
summarise(count = length(text2))
Output
ID text count
<int> <chr> <int>
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
Base R alternative, using EJJ's great recommendation for dict:
sapply(strsplit(df.words$text, "\\s+"),
function(z) sum(z %in% dict$text2))
# [1] 4 3
I thought that this would be a clear winner in speed, but apparently doing sum(. %in% .) one at a time can be a little expensive. (It is slower with this data.)
Faster but not necessarily simpler:
words <- strsplit(df.words$text, "\\s+")
words <- sapply(words, `length<-`, max(lengths(words)))
found <- array(words %in% dict$text2, dim = dim(words))
colSums(found)
# [1] 4 3
It's a hair faster (~ 10-15%) than EJJ's solution, so likely only a good thing if you need to wring some performance out of it.
(Caveat: EJJ's is faster with this 2-row dataset. If the data is 1000x larger, then my first solution is a little faster, and my second solution is twice as fast. Benchmarks are benchmarks, though, don't optimize code beyond usability if speed/time is not a critical factor.)

Regex's are working, but code looks horrible

I'm cleaning up a long list of noun-phrases for further text mining. They're supposed to be 1- or 2-word phrases, but some have / in a conjunction. Here's what I've got:
library(tidyverse)
conjuncts <- tibble(usecase = 1:3,
classes = c("Insulators/Insulation",
"Optic/light fiber",
"Magnets"))
And I want:
wanted <- tibble(usecase = c(1,1,2,2,3),
classes = c("Insulators/Insulation",
"Insulators/Insulation",
"Optic/light fiber",
"Optic/light fiber",
"Magnets"),
bigrams = c("Insulators", "Insulation",
"Optic fiber", "Light fiber", NA))
I've got something working, but it's horrible and non-extensible.
patternSplit <- function(class){
regexs <- c("(?x) ^ (\\w+) / (\\w+) $",
"(?x) ^ (\\w+) / (\\w+) \\s+ (\\w+) $")
if(str_detect(class, regexs[1])){
extr <- str_match(class, regexs[1])
list(extr[1,2],
extr[1,3])
} else if(str_detect(class, regexs[2])){
extr <- str_match(class, regexs[2])
list(paste(extr[1,2], extr[1,4]),
paste(extr[1,3], extr[1,4]))
} else {
list(NA_character_)
}
}
anx <- conjuncts %>%
mutate(bigrams = map(classes, patternSplit)) %>%
unnest(cols = "bigrams") %>%
unnest(cols = "bigrams")
Which gives me what I wanted, but blecchh!
# A tibble: 5 x 3
usecase classes bigrams
<int> <chr> <chr>
1 1 Insulators/Insulation Insulators
2 1 Insulators/Insulation Insulation
3 2 Optic/light fiber Optic fiber
4 2 Optic/light fiber light fiber
5 3 Magnets NA
The top two problems (1) I have to run the rexex twice - once with str_detect to get the logical for the if / else and again with str_match to pull out the tokens. (2) I have do the double unnest to unwind the list structure. And smaller problem (3) Can I get out of if / else, into case_when or switch?
I'll eventually be extending this to about a dozen patterns and use-cases.
Here is solution using / as seperator to detect word phrases, then using ifelse to get desired result:
patternSplit<- function(x,p="[A-z]+[/][A-z]+"){
x1<- stringr::str_extract(x,p)
x2<- stringr::str_replace(x,p,"")
return(cbind(val1=x1,val2=x2))
}
conjuncts<- cbind(conjuncts,conjuncts$classes %>% patternSplit()) %>%
tidyr::separate_rows(val1, sep = '/') %>%
dplyr::mutate(bigrams= ifelse(!is.na(val1),paste0(val1,val2),val1)) %>%
dplyr::select(-contains("val"))
conjuncts
usecase classes bigrams
1 1 Insulators/Insulation Insulators
2 1 Insulators/Insulation Insulation
3 2 Optic/light fiber Optic fiber
4 2 Optic/light fiber light fiber
5 3 Magnets <NA>

.TXT in long form to data.frame in wide form in R

I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.
An example data file is available here:
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
I am running into a couple road blocks that I could use some input into how navigate.
1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.
Some additional corks to the data:
1) the asteriks are NOT necessary
2) the cognitive TOTAL will never have a value
I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.
Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.
Here is what I have tried:
# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
#docvarsfrom = "filenames",
dvsep = " ")
From here I do not know how to split the data, in my head I would do something like this
data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))
This of course, gives the correct column names but removes almost all the data I actually am interested in.
Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.
Sincerely,
Alex
Here is a way of converting that text file to a dataframe that you can do analysis on
library(tidyverse)
input <- read_lines('c:/temp/scores.txt')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
table <- left_join(scores,
header,
by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)
# A tibble: 10 x 6
index type Score `T-Score` row title
<chr> <chr> <chr> <chr> <int> <chr>
1 "1. " Line Bisection 9 53 3 Subtest/Section
2 "2. " Semantic Memory 8 51 4 Subtest/Section
3 "3. " Word Fluency 1 56* 5 Subtest/Section
4 "4. " Recognition Memory 40 59 6 Subtest/Section
5 "5. " Gesture Object Use 2 68 7 Subtest/Section
6 "6. " Arithmetic 5 49 8 Subtest/Section
7 "7. " Spoken Words 17 45* 14 Spoken Language
8 "9. " Spoken Sentences 25 53* 15 Spoken Language
9 "11. " Spoken Paragraphs 4 60 16 Spoken Language
10 "8. " Written Words 14 45* 20 Written Language
What is the source for the code at the link provided?
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.
This is what was accomplished before hitting a wall.
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))
head(df2)
When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.
If anything, it would good to know something about those unfamiliar data types.
Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.
library(stringr)
library(dplyr)
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, "Score" = str_extract(df$V2, "\\d+") )
df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$"))
df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))
df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))
df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*"))
df6 <- replace(df5, is.na(df5), "")
df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$")) # remove digits, new variable
df7$V1 <- NULL # remove variable
df7$V2 <- NULL # remove variable
df8 <- df7[, c(6,3,1,4,2,5)] # re-align variables
head(df8,15)

Resources