I am new to text mining, R and the tidy approach and am looking for kind advice to overcome a hurdle with pre-processing text strings read in from pdf files. The specific problem is with a multiple string replacement over multiple strings.
I have data from 2 sources:
PDF reports: I have used map and pdf_text functions to read a directory of pdf reports into a data frame which creates a tibble with 3 columns: page_string, filename and pagenumber. There are 1,191 entries, and page_string holds a string being one page of pdf text.
CSV file of professional words and replacements: I have used the read_CSV function to import this. The resultant df has 2 columns with 77 entries: target_vocab (e.g. social worker) and replace_token (e.g. social_worker).
My aim is to amend the current character strings in my main data frame, replacing strings which match the professional words in target_vocab with the associated compound token in replace_token prior to tokenization.
String example - before and after string substitution:
"Social workers and early help staff work with multi-agency partners to produce child in need plans led by the allocated social worker".
"Social_workers and early_help staff work with multi_agency partners to produce CIN plans led by the allocated social_worker".
It is hopefully clear that I want "social workers", "early help", "multi-agency", "child in need" and "social worker" replaced with compound tokens.
My code:
#a bank of pdf reports and "professional_words.csv" in current working directory
library(tidyverse)
library(pdftools)
#> Using poppler version 0.73.0
library(tidytext)
library(stringr)
pdf_filenames <- list.files(pattern = "pdf$")
words_df <- read_csv("professional_words.csv", skip = 1, col_names = c("target_vocab", "replace_token"))
pattern_vector <- words_df$target_vocab
replacement_vector <- words_df$replace_token
pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
mutate(filename = .x, pagenumber = row_number()) %>%
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
The bit that doesn't work within the map function is:
mutate(page_string = str_replace_all(page_string,pattern_vector,replace_vector)))
I have tried all sorts of variations, including gsub, breaking it away from the pipe to a separate map function etc. but with my limited knowledge I am not fixing it.
I have consistently had the warning:
In stri_replace_all_regex(string, pattern,
fix_replacement(replacement), : longer object length is not a
multiple of shorter object length
With this variation of code I am also getting the error:
Problem with mutate() input page_string. x Input
page_string can't be recycled to size 10. ℹ Input page_string is
str_replace_all(page_string, pattern = pattern_vector, replacement = replace_vector). ℹ Input page_string must be size 10 or 1, not 77.
My sense is that map or list functions will help me but I seem to be going round in circles and I haven't yet found a Stack Overflow response that has helped me fix the problem.
There is a way to do what you want with str_replace_all from stringr. Instead of providing a pattern and a replacement, pass a named vector to pattern. Something like pattern = c("social worker" = social_worker", "early help" = "early_help", "multi agency" = "multi_agency"). I'll start with a simple example, and then show you how to have R build that named vector from your words_df.
# Simple example
library(stringr)
string <- "The quick brown fox"
str_replace_all(string, pattern = c("brown" = "green", "fox" = "badger"))
[1] "The quick green badger"
Here is how you do it with some fake data that looks like yours, having R build the named replacement vector.
# Making the fake data
words_df <- data.frame(target = c("fox", "brown", "quick"),
replacement = c("badger", "green", "versatile"))
strings_df <- data.frame(page_string = c("The quick brown fox",
"The sad yellow fox",
"The quick old dog",
"The lazy brown dog",
"The quick happy fox"))
# Making the named replacement vector from words_df
replacements <- c(words_df$replacement)
names(replacements) <- c(words_df$target)
# Doing the replacement
library(dplyr)
strings_df %>%
mutate(new_string = str_replace_all(page_string,
pattern = replacements))
# The output
page_string new_string
1 The quick brown fox The versatile green badger
2 The sad yellow fox The sad yellow badger
3 The quick old dog The versatile old dog
4 The lazy brown dog The lazy green dog
5 The quick happy fox The versatile happy badger
str_replace_all does not work like that. If you provide vectors for pattern and replacement, the first pattern/replacement is applied to the first element of string and so on. See the following example:
library(stringr)
fruits <- c("one apple two", "two pears", "three bananas")
pattern_v <- c("one", "two", "three")
replace_v <- c("1", "2", "3")
str_replace_all(fruits, pattern_v, replace_v)
#> [1] "1 apple two" "2 pears" "3 bananas"
Created on 2020-08-25 by the reprex package (v0.3.0)
Note that "two" gets only replaced with "2" in the second element of string. Therefore, it doesn't work if the pattern/replacement vectors are not of the same length (or a multiple) of string:
pattern_v <- c("one", "two")
replace_v <- c("1", "2")
str_replace_all(fruits, pattern_v, replace_v)
[1] "1 apple two" "2 pears" "three bananas"
warning:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
longer object length is not a multiple of shorter object length
To circumvent this problem, you can pass a named vector for pattern:
str_replace_all(fruits, c("one" = "1", "two" = "2", "three" = "3"))
[1] "1 apple 2" "2 pears" "3 bananas"
Ben's answer gives a great way how to make the creation of the vector easy:
pattern_new <- c("one", "two", "three")
names(pattern_new) <- c("1", "2", "3")
str_replace_all(fruits, pattern_new)
[1] "one apple two" "two pears" "three bananas"
Problem solved thanks to speedy responses and here is the working code to resolve my question for those that may be struggling in the future:
professional_terms <- c(words_df$replace_token)
names(professional_terms) <- c(words_df$target_words)
pdf_pages_df <- map_df(pdf_filenames, ~ tibble(page_string = pdf_text(.x)) %>%
mutate(filename = .x, pagenumber = row_number(), page_string = str_replace_all(page_string,pattern = professional_terms)))
Related
This is my first time posting; please let me know if I'm doing any beginner mistakes. In my specific case I have a vector of strings, and I want to collapse some adjacent rows. I have one vector indicating the starting position and one indicating the last element. How can I do this?
Here is some sample code and my approach that does not work:
text <- c("cat", "dog", "house", "mouse", "street")
x <- c(1,3)
y <- c(2,5)
result <- as.data.frame(paste(text[x:y],sep = " ",collapse = ""))
In case it's not clear, the result I want is a data frame consisting of two strings: "cat dog" and "house mouse street".
Not sure this is the best option, but it does the job,
sapply(mapply(seq, x, y), function(i)paste(text[i], collapse = ' '))
#[1] "cat dog" "house mouse street"
Either use base R with
mapply(function(.x,.y) paste(text[.x:.y],collapse = " "), x, y)
or use the purrr package as
map2_chr(x,y, ~ paste(text[.x:.y],collapse = " "))
Both yield
# [1] "cat dog" "house mouse street"
The output as a data frame depends on the structure you want: rows or columns
I think you want
result <- data.frame(combined = c(paste(text[x[1]:y[1]], collapse = " "),
paste(text[x[2]:y[2]], collapse = " ")))
Which gives you
result
#> combined
#> 1 cat dog
#> 2 house mouse street
Another base R solution, using parse + eval
result <- data.frame(new = sapply(paste0(x,":",y),function(v) paste0(text[eval(parse(text = v))],collapse = " ")),
row.names = NULL)
such that
> result
new
1 cat dog
2 house mouse street
Consider the following dataset :
a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)
I m trying to detect string1 in string2 BUT my goal is to not only detect exact matching. I m looking for a way to detect the presence of string1 words in string2, whatever the order words appear. As an example, the string "my beautiful house is cool" should return TRUE when searching for "my house".
I have tried to illustrate the expected behaviour of the script in the "return" column of above the example dataset.
I have tried grepl() and str_detect() functions but it only works with exact match. Can you please help ? Thanks in advance
The trick here is to not use str_detect as is but to first split the search_words into individual words. This is done in strsplit() below. We then pass this into str_detect to check if all words are matched.
library(stringr)
search_words <- c("my house", "green", "the cat is", "a girl")
words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
patterns <- strsplit(search_words," ")
mapply(function(word,string) all(str_detect(word,string)),words,patterns)
One base R option without the involvement of split could be:
n_words <- lengths(regmatches(df[, 1], gregexpr(" ", df[, 1], fixed = TRUE))) + 1
n_matches <- mapply(FUN = function(x, y) lengths(regmatches(x, gregexpr(y, x))),
df[, 2],
gsub(" ", "|", df[, 1], fixed = TRUE),
USE.NAMES = FALSE)
n_matches == n_words
[1] TRUE TRUE TRUE FALSE
It, however, makes the assumption that there is at least one word per row in string1
I'm learning R and I'm trying to use regex to extract specific text. I would like to capture a number and the unit of measure from a recipe for a specific ingredient.
For example for the following text:
text <- c("0.5 Tb of butter","3 grams (0.75 sticks) of chilled butter","2 tbs softened butter", "0.3 Tb of milk")
I would like to extract the numbers and units relating only to butter, i.e:
0.5 Tb
3 grams
2 tbs
I think this would be best done using regex, but I'm quite new to this so I'm struggling somewhat.
Using str_match I can get the number in front of specific unit like this:
str_match(text, "\\s*(\\d+)\\s*Tb")
[,1] [,2]
[1,] "5 Tb" "5"
[2,] NA NA
[3,] NA NA
[4,] "3 Tb" "3"
But how could I get only the values that relate to butter and for a range of units. Is it possible to make a list of possible units (i.e. grams, tbs, Tb etc.) and ask to match any of them (so that in this example grams would match but not sticks)?
Or perhaps this would be done better with some loop? I could put each sentence into a dataframe, loop through each row asking if there is 'butter' in the row search for a number in it and extract the the number and the word that follows, which should be the unit of measure.
Thanks for the help.
A base R solution would be to grep out the butter lines and then use read.table to parse them given that the matched items are always the first two fields. No packages are used and the only regular expression used is the simple expression butter.
butter <- grep("butter", text, value = TRUE)
read.table(text = butter, fill = TRUE, as.is = TRUE)[1:2]
giving:
V1 V2
1 0.5 Tb
2 3.0 grams
3 2.0 tbs
An option would be to detect the 'butter' in the strings and then use str_extract
str_extract(grep("butter", text, value = TRUE), "[0-9.]+\\s+\\w+")
#[1] "0.5 Tb" "3 grams" "2 tbs"
Or using str_detect with str_extract
library(tidyverse)
str_detect(text, "butter") %>%
extract(text, .) %>%
str_extract("[0-9.]+\\s+\\w+")
#[1] "0.5 Tb" "3 grams" "2 tbs"
You may want to take a look at something like this ([\d.]+)\s([a-zA-Z]+).*butter
sub("^(\\S+\\s+\\S+).*", "\\1", text[grepl("butter", text)])
[1] "0.5 Tb" "3 grams" "2 tbs"
\\s+ to match any number of spaces and \\S+ to match any number of non-spaces. ^ to start at the beginning.
text[grepl("butter", text)] returns only the text elements which contain the word butter. Perhaphs add the argument ignore.case = TRUE to grepl() for it to also match Butter...
I need to remove all apostrophes from my data frame but as soon as I use....
textDataL <- gsub("'","",textDataL)
The data frame gets ruined and the new data frame only contains values and NAs, when I am only looking to remove any apostrophes from any text that might be in there? Am I missing something obvious with apostrophes and data frames?
To keep the structure intact:
dat1 <- data.frame(Col1= c("a woman's hat", "the boss's wife", "Mrs. Chang's house", "Mr Cool"),
Col2= c("the class's hours", "Mr. Jones' golf clubs", "the canvas's size", "Texas' weather"),
stringsAsFactors=F)
I would use
dat1[] <- lapply(dat1, gsub, pattern="'", replacement="")
or
library(stringr)
dat1[] <- lapply(dat1, str_replace_all, "'","")
dat1
# Col1 Col2
# 1 a womans hat the classs hours
# 2 the bosss wife Mr. Jones golf clubs
# 3 Mrs. Changs house the canvass size
# 4 Mr Cool Texas weather
You don't want to apply gsub directly on a data frame, but column-wise instead, e.g.
apply(textDataL, 2, gsub, pattern = "'", replacement = "")
I am trying to parse a character string into its parts, check if each of the parts exist in a separate vocabulary, and later re-assemble only those strings whose parts are in the vocabulary. The vocabulary is a vector of words, and is created separately from the the strings I want to compare. The final goal is to create a data frame with only those strings whose word parts are in the vocabulary.
I have written a piece of code to parse out the data into strings, but cannot figure out how to make the comparison. If you believe that parsing out the data is not the optimal solution, please let me know.
Here is an example:
Assume that I have three character strings:
"The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue"
and my vocabulary consists of the words:
cat, **the**, **elephant**, hippo,
**in**, run, **is**, bike,
walk, **room, is, blue, cannot**
In this case I will pick only the first and third strings, because each of their word parts are matched to a word in my vocabulary. I will not select the second string, because the words "dog" and "swim" are not in the vocabulary.
Thank you!
Per request, attached is the code I have written so far to clean the strings, and parse them into unique words:
animals <- c("The elephant in the room is blue", "The dog cannot swim", "The cat is blue")
animals2 <- toupper(animals)
animals2 <- gsub("[[:punct:]]", " ", animals2)
animals2 <- gsub("(^ +)|( +$)|( +)", " ", animals2)
## Parse the characters and select unique words only
animals2 <- unlist(strsplit(animals2," "))
animals2 <- unique(animals2)
Here how I would do :
Read the data
clean vocab to remove extra spaces and *
Loop over strings , using setdiff
My code is :
## read your data
tt <- c("The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue")
vocab <- scan(textConnection('cat, **the**, **elephant**, hippo,
**in**, run, **is**, bike,
walk, **room, is, blue, cannot**'),sep=',',what='char')
## polish vocab
vocab <- gsub('\\s+|[*]+','',vocab)
vocab <- vocab[nchar(vocab) >0]
##
sapply(tt,function(x){
+ x.words <- tolower(unlist(strsplit(x,' '))) ## take lower (the==The)
+ length(setdiff(x.words ,vocab)) ==0
+ })
The elephant in the room is blue The dog cannot swim The cat is blue
TRUE FALSE TRUE