How to connect sentences with invalid line breaks - r

I have a long character with several sentences in a row.
ex :
"I have a
apple.
but I like banana."
It's irregularly lined up like this. Is there any way to automatically concatenate this?
result :
"I have a apple.
but I like banana."

I added few more lines to the input for testing purpose.
x <- "I have a
apple.
but I like banana.
This is new text.
and another one
to complete it."
#split the string on newline
tmp <- trimws(strsplit(x, '\n')[[1]])
#Create a grouping variable which increments every time the statement
#ends on ".", paste each group together.
tapply(tmp, c(0, head(cumsum(grepl('\\.$', tmp)), -1)), function(x) paste0(x, collapse = ' ')) |>
#Collapse data in one string
paste0(collapse = '\n') |>
#For printing purpose.
cat()
#I have a apple.
#but I like banana.
#This is new text.
#and another one to complete it.

Related

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput() (including the step force to lowercase:
Emoji_struct <- c(
list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
)
Current coding (data_orig is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Any ideas? Lower cases would work, too.
Best regards. Stay safe. Stay healthy.
Answer
Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
Example
Single character string:
replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
Character vector:
replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
List:
list("list_element_1: 🔥", "list_element_2: ❌") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
Rationale
To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.

Is there an R function to clean via a custom dictionary

I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:
#[1] "never give up uouo cbbuk jeez"
would become
#[1*] "never give up"
as the words "never","give",and "up" are all in the custom dictionary.
I have previously tried the following:
#Reading the custom dictionary as a function
english.words <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
DF2 <- DF1[(english.words(DF1$Text)),]
but my result is a character list with one word. Any advice?
You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.
DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x)
paste0(Filter(english.words, x), collapse = ' '))
Here I have created a new column called Text1 with only english words, if you want to replace the original column you can save the output in DF1$Text.
Since you use a dataframe you could try this:
library(tidyverse)
library(tidytext)
dat<-tibble(text="never give up uouo cbbuk jeez")
words_to_keep<-c("never","give","up")
keep_function<-function(data,words_to_keep){
data %>%
unnest_tokens(word, text) %>%
filter(word %in% words_to_keep) %>%
nest(text=word) %>%
mutate(text = map(text, unlist),
text = map_chr(text, paste, collapse = " "))
}
keep_function(dat,words_to_keep)

Extracting strings from a PDF with R

I have this PDF file from European parliament, that you can download here.
I have downloaded it and put it in R.
It contains lists of names of Members of European Parliament (MEP) after a session of vote.
I want to extract just bits of these lists. Specifically, I want to extract and put in a table the names situated between "AVGIVNA RÖSTER" and 0, see the text highlighted in this screenshot.
Similar series of names repeat in the PDF. It refers to specific votes. I want them all in a table. MEP's names change but the structure remains, they are always situated between the bits "AVGIVNA RÖSTER" and "0".
I thought of using a startswith function and and a for loop"but I struggle with the writing.
Here is what I did so far:
library(pdftools)
library(tidyverse)
votetext <- pdftools::pdf_text("MEP.pdf") %>%
readr::read_lines()
You could try something like this
votetext <- pdftools::pdf_text("MEP.pdf") %>%
readr::read_lines()
a <- which(grepl("AVGIVNA RÖSTER", votetext)) #beginning of string
b <- which(grepl("^\\s*0\\s*$", votetext)) #end of string
sapply(a, function(x){paste(votetext[x:(min(b[b > x]))], collapse = ". ")})
Note that in the definition of b I use \\s* to find white space in a string.
In general you could first remove trailing and leading white space, see this question.
In your case you could do:
votetext2 <- pdftools::pdf_text("data.pdf") %>%
readr::read_lines() %>%
str_remove("^\\s*") %>% #remove white space in the begining
str_remove("\\s*$") %>% #remove white space in the end
str_replace_all("\\s+", " ") #replace multiple white-spaces with a singe white-space
a2 <- which(votetext2 == "AVGIVNA RÖSTER")
b2 <- which(votetext2 == "0")
result <- sapply(a2, function(x){paste(votetext2[x:(min(b2[b2 > x]))], collapse = ". ")})
result then looks like this:
`"AVGIVNA RÖSTER. Martin Hojsík, Naomi Long, Margarida Marques, Pedro Marques, Manu Pineda, Ramona Strugariu, Marie Toussaint,. + Dragoş Tudorache, Marie-Pierre Vedrenne. -. Agnès Evren. 0"

Filtering text from numbers and stopwords in R(not for tdm)

I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system
Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"
There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Resources