Cleaning strings in sparklyr using regex

Cleaning strings in sparklyr using regex - r

I'm trying to clean strings in a table in sparklyr using regexp_replace. I need to remove both multiple spaces between words and specific whole words.
Establish Spark Connection
pharms <- spark_read_parquet(sc, 'pharms', 's3/path/to/pharms', infer_schema = TRUE, memory = FALSE)
Vector to clean
The df vector I want to clean looks like this, but it is within a table in the sparklyr connection:
drug_strings <- c("tablomiacin sodium tab mg", "nsaid caps mg")
The desired output once the regex processes the data would be something like this:
Desired Outcomes
[1] "tablomiacin sodium", "nsaid"
Attempts
I've tried various combinations used in regex such as:
pharms_cln <- pharms %>%
distinct(drug_strings)%>%
mutate(new_strings=regexp_replace(drug_strings, "\\b(caps|mg|tab)\\b", ""))
pharms_cln <- pharms %>%
distinct(drug_strings)%>%
mutate(new_strings=regexp_replace(drug_strings, "\\s+", ""))
But they all just replace all letters or substrings and not just the individual word or print an error related to hive. Similarly the efforts I've tried to remove blanks spaces just seem to remove the letter 's'.

If the rule for the sought replacement "anything preceding caps|mg|tab", then this may work:
Data:
drug_strings <- c("tablomiacin sodium tab mg", "nsaid caps mg")
Solution:
trimws(gsub("\\b(tab|mg|caps)\\b", "", drug_strings))
[1] "tablomiacin sodium" "nsaid"
If for some reason you need to use str_extract, you can do this:
str_extract(gsub("\\s{2,}", " ", drug_strings), "\\b\\w+\\b(\\s\\b\\w+\\b)*(?=\\s\\b(tab|mg|caps)\\b)")
This first reduces all multiple white space characters to just one such char, and then does the extraction.

Someone who knows regex could certainly streamline this code, but the following using using the str_remove function from the stringr package.
drug_strings <- c("tablomiacin tab mg", "nsaid caps mg")
drug_strings <- data.frame(drug_strings)
drug_strings <- drug_strings %>%
mutate(new_strings=str_remove(drug_strings, "\\b(caps|mg|tab)\\b")) %>%
mutate(new_strings=str_remove(new_strings, "\\s+")) %>%
mutate(new_strings = str_remove(new_strings, "mg"))
``

Related

Use stringr to extract the whole word in a string with a particular set of characters in it

I have a series of strings that have a particular set of characters. What I'd like to do is be able to extract just the word from the string with those characters in it, and discard the rest.
I've tried various regex expressions to do it but I either get it to split all the words or it returns the entire string. Following is an example of the kinds of strings. I've been trying to use stringr::str_extract_all() as there are instances where there are more than one word that needs to be pulled out.
data <- c("AlvariA?o, 1961","Andrade-Salas, Pineda-Lopez & Garcia-MagaA?a, 1994", "A?vila & Cordeiro, 2015", "BabiA?, 1922")
result <- unlist(stringr::str_extract_all(data, "regex"))
From this I'd like a result that pulls all the words that has the "A?", like this:
result <- c("AlvariA?o", "MagaA?a", "A?vila", "BabiA"?)
It seems really simple but my regex knowledge is just not cutting it at the moment.

To match ? it needs to be escaped with \\?, so A\\? will match A?. \\w matches any word character (equivalent to [a-zA-Z0-9_]) and * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy).
unlist(stringr::str_extract_all(data, "\\w*A\\?\\w*"))
#[1] "AlvariA?o" "MagaA?a" "A?vila" "BabiA?"

I made as function but pretty worse than Gki's...
library(quanteda)
set_of_character <- function(dummy, key){
n <- nchar(key)
dummy %>% str_split(., " ") %>%
unlist %>%
str_replace(., ",", "") %>%
sapply(., function(x) {
x %>%
tokens("character") %>%
unlist() %>%
char_ngrams(n, concatenator = "")
}) %>%
sapply(., function(x) (key %in% x)) %>% which(TRUE) %>% names %>%
return
}
for your example,
set_of_character(data, "A?")
[1] "AlvariA?o" "Garcia-MagaA?a" "A?vila" "BabiA?"

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.

You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

Filtering text from numbers and stopwords in R(not for tdm)

I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system

Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"

There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "

Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!

A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))

If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks

One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))

Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.

Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.

You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.

If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.

R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file

If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}

I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.

If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.

To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).

Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)

Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Cleaning strings in sparklyr using regex - r

Related

Use stringr to extract the whole word in a string with a particular set of characters in it

R: find a specific string next to another string with for loop

Filtering text from numbers and stopwords in R(not for tdm)

Removing Whitespace From a Whole Data Frame in R

Avoid that space in column name is replaced with period (".") when using read.csv()

Categories

Resources