Count unique words in a string using dplyr (R) - r

Let's say I have a string as follows:
string <- "the home home on the range the friend"
All I want to do is determine which words in the string appear at least 2 times.
The psuedocode here is:
Count how many times each word appears
Return list of words that have more than two appearances in the string
Final result should be a list featuring both the and home, in that order.
I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.

We can split the string by space, get the table and subset based on frequency
out <- table(strsplit(string, "\\s+")[[1]])
out[out >=2]
home the
2 3

Yet another possible solution:
library(tidyverse)
data.frame(x = str_split(string, "\\s+", simplify = T) %>% t) %>%
add_count(x) %>%
filter(n >= 2) %>%
distinct %>%
pull(x)
#> [1] "the" "home"

library(tidyverse)
data.frame(string) %>%
separate_rows(string) %>%
count(string, sort = TRUE) %>%
filter(n >= 2)
Result
# A tibble: 2 × 2
string n
<chr> <int>
1 the 3
2 home 2

Here's an approach using quanteda that prints "the" before "home" as requested in the original post.
library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm#Dimnames$features[aDfm#x > 1]
...and the output:
> aDfm#Dimnames$features[aDfm#x > 1]
[1] "the" "home"

Here is another option using tidytext and tidyverse, where we first separate each word (unnest_tokens), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe to return a named vector.
library(tidytext)
library(tidyverse)
tibble(string) %>%
unnest_tokens(word, string) %>%
count(word, sort = TRUE) %>%
filter(n >= 2) %>%
deframe()
Output
the home
3 2
Or if you want to leave as a dataframe, then you can just ignore the last step with deframe.

Related

Extract date after string in R

I'm trying to extract dates from a Notes column using tidyr's extract function. The data I'm working on looks like this:
dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )
I tried extracting the date following the string "S/DATE" like this:
extract <- extract(
dates,
col = "Notes",
into = "Start_date",
regex = "(?<=(S\\/DATE:)).*" # Using regex lookahead
)
However, this only extracts the string "S/DATE:", not the date after it. When I tried this on regex101.com, it works as expected.
Thanks. Ibrahim
You could use sub here for a base R option:
s_date <- ifelse(grepl("S/DATE", dates$Notes),
sub("^.*\\bS/DATE: (\\S+).*$", "\\1", dates$Notes), NA)
s_date
[1] NA "28/08/19" NA
Note that the call to grepl above is needed here, because sub by default will return the entire input string (in this case the full Notes) in the event that S/DATE be not found in the text.
One method can be like this one also. (Assuming that you need either of S/DATE or START as your expected new column name is Start_date). If however all such values aren't required you may easily modify this syntax.
Explanation -
In the innermost expr Notes column has been splitted into list by either of these separators : or \n.
In this list, blanks are removed then
In the modified list item next to Start or S/Date is extracted using sapply which simplifies the list into a vector (if possible)
lastly lubridate::dmy is used in outermost expr.
sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])
[1] "09/01/2019" "28/08/19" "04/12/2018"
If you'll wrap the above in lubridate::dmy dates will be correctly formatted too
dmy(sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))]))
[1] "2019-01-09" "2019-08-28" "2018-12-04"
Further, this can be passed into dplyr pipes, so as to simultaneously create a new column in your dates
dates %>% mutate(Start_Date = dmy(sapply(strsplit(Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])))
col1 Notes End_date Start_Date
1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
I would combine stringr and lubridate:
dates %>%
mutate(
Start_Date =
sub("\ns/date:", "\nstart:", tolower(Notes)) %>%
str_remove_all("(.*\nstart:)|(\n.*)") %>%
trimws() %>%
lubridate::dmy()
)
# col1 Notes End_date Start_Date
# 1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
# 2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
# 3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
The answer is not as concise, but I find it intuitive and easy to follow the steps.
First I substitute one start-pattern with another (sub), where I use tolower to make all lower caps. Then I remove everything before the start date, and everything after the line change str_remove_all. Finally I trim whitespace (trimws) and turn into a date (lubridate::dmy).
Another approach is splitting the text and dealing with smaller chunks.
Step by step illustration, with one row of data
# Split the text on newlines, yielding dates with labels
dates$Notes %>% head(1) %>% strsplit("\n")
[[1]]
[1] "DOB: 12/10/62" "START: 09/01/2019" "END: 09/01/2020"
Drilling down to the next level
# Split each name/value pair on colons
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*")
[[1]]
[1] "DOB" "12/10/62"
[[2]]
[1] "START" "09/01/2019"
[[3]]
[1] "END" "09/01/2020"
Extract the individual values
# extract a vector of name labels
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1])
[1] "DOB" "START" "END"
# extract a vector of associated values
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
[1] "12/10/62" "09/01/2019" "09/01/2020"
With some clever dplyr usage, you'll get a data frame
dates %>%
group_by(col1) %>%
# summarize can collapse many rows into one or expand one into many
summarize(
name = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1]),
value = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
) %>%
ungroup()
Result, all of the values separated and ready for further processing
# A tibble: 8 x 3
col1 name value
<chr> <chr> <chr>
1 customer DOB 12/10/62
2 customer START 09/01/2019
3 customer END 09/01/2020
4 customer2 NA NA
5 customer2 S/DATE 28/08/19
6 customer2 R/DATE 27/08/20
7 customer3 DOB 13/01/1980
8 customer3 Start 04/12/2018

R - Count exact matches in string from list of words, then calculate overall sentiment using score per word

I have a dataset containing a column of strings from which I wish to calculate an overall sentiment score, and a data frame containing all the unique words that appear in all the strings , each of which is assigned a score:
library(stringr)
df <- data.frame(text = c('recommend good value no problem','terrible quality no good','good service excellent quality commend'), score = 0)
words <- c('recommend','good','value','problem','terrible','quality','service','excellent','commend')
scores <- c(1,2,1,-2,-3,1,0,3,1)
wordsdf <- data.frame(words,scores)
The only way I have been able to get close to this is by using a nested for loop and the str_count function from the stringr package:
for (i in 1:3){
count = 0
for (j in 1:9){
count <- count + (str_count(df$text[i],as.character(wordsdf$words[j])) * wordsdf$scores[j])
}
df$score[i] <- count
}
This almost achieves what I want:
text score
1 recommend good value no problem 3
2 terrible quality no good 0
3 good service excellent quality commend 7
However, since the word 'commend' is also contained in the word 'recommend', my code calculates the scores as if both words are contained in the string.
So I have two queries:
1 - Is there a way to get it to match only to exact words?
2 - Is there a way to achieve this without using the nested loop?
One tidyverse possibility could be:
df %>%
rowid_to_column() %>%
mutate(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
full_join(wordsdf, by = c("text" = "words")) %>%
group_by(rowid) %>%
summarise(text = paste(text, collapse = " "),
scores = sum(scores, na.rm = TRUE)) %>%
ungroup() %>%
select(-rowid)
text scores
<chr> <dbl>
1 recommend good value no problem 2
2 terrible quality no good 0
3 good service excellent quality commend 7
It, first, splits the "text" column into separate words. Second, it performs a full join on these words. Finally, it combines the words from "text" column again and performs the summation.

Count number of times a word appears (dplyr)

Simple question here, perhaps a duplicate of this?
I'm trying to figure out how to count the number of times a word appears in a vector. I know I can count the number of rows a word appears in, as shown here:
temp <- tibble(idvar = 1:3,
response = (c("This sounds great",
"This is a great idea that sounds great",
"What a great idea")))
temp %>% count(grepl("great", response)) # lots of ways to do this line
# answer = 3
The answer in the code above is 3 since "great" appears in three rows. However, the word "great" appears 4 different times in the vector "response". How do I find that instead?
We could use str_count from stringr to get the number of instances having 'great' in each row and then get the sum of that count
library(tidyverse)
temp %>%
mutate(n = str_count(response, 'great')) %>%
summarise(n = sum(n))
# A tibble: 1 x 1
# n
# <int>
#1 4
Or using regmatches/gregexpr from base R
sum(lengths(regmatches(temp$response, gregexpr('great', temp$response))))
#[1] 4
Off the top of my head, this should solve your problem:
library(tidyverse)
temp$response %>%
str_extract_all('great') %>%
unlist %>%
length

Count specific length of letter of the volume of words

I try to find in a text sentence words of more than 4 letters
I tried this:
fullsetence <- as.character(c("A test setence with test length","A second test for length"))
nchar(fullsetence)
I expect to take as results, based for example in the previous example sentence/string one has 2 words with length greater than 4 letters and the second has 2 words.
Using nchar I take the full length of characters from the string.
What is the right way to make it?
library(dplyr)
library(purrr)
# vector of sentences
fullsetence <- as.character(c("A test setence with test length","A second test for length"))
# get vector of counts for words with more than 4 letters
fullsetence %>%
strsplit(" ") %>%
map(~sum(nchar(.) > 4)) %>%
unlist()
# [1] 2 2
# create a dataframe with sentence and the corresponding counts
# use previous code as a function within "mutate"
data.frame(fullsetence, stringsAsFactors = F) %>%
mutate(Counts = fullsetence %>%
strsplit(" ") %>%
map(~sum(nchar(.) > 4)) %>%
unlist() )
# fullsetence Counts
# 1 A test setence with test length 2
# 2 A second test for length 2
If you want to get the actual words with more than 4 letters you can use this in a similar way:
fullsetence %>%
strsplit(" ") %>%
map(~ .[nchar(.) > 4])
data.frame(fullsetence, stringsAsFactors = F) %>%
mutate(Words = fullsetence %>%
strsplit(" ") %>%
map(~ .[nchar(.) > 4]))

R: How to best extract two XML attributes from a node?

The following code extracts one attribute (or all) from an XML file:
library(xml2);library(magrittr);library(readr);library(tibble);library(knitr)
fname<-'https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml'
fname
x<-read_xml(fname)
xpath="//d1:ItemDef"
itemsNames <- x %>% xml_find_all(xpath, ns=xml_ns(x)) %>% xml_attr('Name')
items <- x %>% xml_find_all(xpath, ns=xml_ns(x))
Item looks like this:
<ItemDef OID="IT.ABNORM" Name="Normal/Abnormal/Not Done" DataType="integer" Length="1" ...
Sample file can be viewed here: https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml
Using pipes and xml_attr, what is the best way to extract both the Name and DataType attributes and have them rbinded?
Ideally it would be a single line of super efficient piped code. I can extract names and types and have 'data.frame(name=names,type=types)' but that seems not the best and most modern.
The result should be a tibble with columns name and data type.
library(purrr)
map(items, xml_attrs) %>%
map_df(as.list) %>%
select(Name, DataType)
## # A tibble: 94 × 2
## Name DataType
## <chr> <chr>
## 1 Normal/Abnormal/Not Done integer
## 2 Actions taken re study drug text
## 3 Actions taken, other text
## 4 Stop Day - Enter Two Digits 01-31 text
## 5 Derived Stop Date text
## 6 Stop Month - Enter Two Digits 01-12 text
## 7 Stop Year - Enter Four Digit Year text
## 8 Outcome text
## 9 Relationship to study drug text
## 10 Severity text
## # ... with 84 more rows
One "base" version:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
do.call(rbind, .) %>%
tbl_df()
NOTE: an issue with ^^ is that if Name or DataType is missing then you're SOL. You can mitigate that with:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
data.table::rbindlist(fill=TRUE) %>%
tbl_df()
or:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
bind_rows() %>%
tbl_df()
if you don't like purrr.

Resources