I have words against their synonyms. In the different data frame, I have sentences. I want to search synonyms from the other dataframe. If found, replace it with word for which synomym found.
dt = read.table(header = TRUE,
text ="Word Synonyms
Use 'employ, utilize, exhaust, spend, expend, consume, exercise'
Come 'advance, approach, arrive, near, reach'
Go 'depart, disappear, fade, move, proceed, recede, travel'
Run 'dash, escape, elope, flee, hasten, hurry, race, rush, speed, sprint'
Hurry 'rush, run, speed, race, hasten, urge, accelerate, bustle'
Hide 'conceal, cover, mask, cloak, camouflage, screen, shroud, veil'
", stringsAsFactors= F)
mydf = read.table(header = TRUE, , stringsAsFactors= F,
text ="sentence
'I can utilize this file'
'I can cover these things'
")
The desired output looks like -
I can Use this file
I can Hide these things
Above is just a sample. In my real dataset, I have more than 10000 sentences.
One can replace , in dt$Synonyms with | so that it can be used as pattern argument of gsub. Now, use dt$Synonyms as pattern and replace occurrence of any word (separated by |) with dt$word. One can use sapply and gsub as:
Edited: Added word-boundary check (as part of pattern in gsub) as suggested by OP.
# First replace `, ` with `|` in dt$Synonyms. Now dt$Synonyms can be
# used 'pattern' argument of `gsub`.
dt$Synonyms <- paste("\\b",gsub(", ","\\\\b|\\\\b",dt$Synonyms),"\\b", sep = "")
# Loop through each row of 'dt' to replace Synonyms with word using sapply
mydf$sentence <- sapply(mydf$sentence, function(x){
for(row in 1:nrow(dt)){
x = gsub(dt$Synonyms[row],dt$Word[row], x)
}
x
})
mydf
# sentence
# 1 I can Use this file
# 2 I can Hide these things
Here is a tidyverse solution...
library(stringr)
library(dplyr)
dt2 <- dt %>%
mutate(Synonyms=str_split(Synonyms, ",\\s*")) %>% #split into words
unnest(Synonyms) #this results in a long dataframe of words and synonyms
mydf2 <- mydf %>%
mutate(Synonyms=str_split(sentence, "\\s+")) %>% #split into words
unnest(Synonyms) %>% #expand to long form, one word per row
left_join(dt2) %>% #match synonyms
mutate(Word=ifelse(is.na(Word), Synonyms, Word)) %>% #keep unmatched words the same
group_by(sentence) %>%
summarise(sentence2=paste(Word, collapse=" ")) #reconstruct sentences
mydf2
sentence sentence2
<chr> <chr>
1 I can cover these things I can Hide these things
2 I can utilize this file I can Use this file
Related
I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:
#[1] "never give up uouo cbbuk jeez"
would become
#[1*] "never give up"
as the words "never","give",and "up" are all in the custom dictionary.
I have previously tried the following:
#Reading the custom dictionary as a function
english.words <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
DF2 <- DF1[(english.words(DF1$Text)),]
but my result is a character list with one word. Any advice?
You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.
DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x)
paste0(Filter(english.words, x), collapse = ' '))
Here I have created a new column called Text1 with only english words, if you want to replace the original column you can save the output in DF1$Text.
Since you use a dataframe you could try this:
library(tidyverse)
library(tidytext)
dat<-tibble(text="never give up uouo cbbuk jeez")
words_to_keep<-c("never","give","up")
keep_function<-function(data,words_to_keep){
data %>%
unnest_tokens(word, text) %>%
filter(word %in% words_to_keep) %>%
nest(text=word) %>%
mutate(text = map(text, unlist),
text = map_chr(text, paste, collapse = " "))
}
keep_function(dat,words_to_keep)
I have the following code in tidyverse and list of words in words.xlsx like:
hello
world
program
data
analysis
v1 = read_excel('words.xlsx') %>%
mutate(words = tolower(words))%>%
pull(1)
for(v in v1){
data1 = data1 %>%
mutate(!! v := as.integer(heading %like% v))
}
I want to edit this code, so that instead of an integer, I get the actual words which were found in every string (separated with a comma) like in the image
You can paste all the words in v1 with word boundaries and use str_extract_all to extract any word in v1 present in data1$heading. str_extract_all would return list of words, we can use sapply to get them as one concatenated string.
sapply(stringr::str_extract_all(data1$heading,
paste0('\\b', v1, '\\b', collapse = '|')), function(x) toString(unique(x)))
One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.
I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system
Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"
There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "
I have a string as follows:
text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I want to eliminate all duplicated addresses, so my expected result is:
expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I tried (^[\w|.|:|\/]*),\1+ in regex101.com and it works removing the first repetition of the string (fails at the second). However, if I port it to R's gsub it doesn't work as expected:
gsub("(^[\\w|.|:|\\/]*),\\1+", "\\1", text)
I've tried with perl = FALSE and TRUE to no avail.
What am I doing wrong?
If they are sequential, you just need to modify your regex slightly.
Take out your BOS anchor ^.
Add a cluster group around the comma and backreference, then quantify it (?:,\1)+.
And, lose the pipe symbol | as in a class it's just a literal.
([\w.:/]+)(?:,\1)+
https://regex101.com/r/FDzop9/1
( [\w.:/]+ ) # (1), The adress
(?: # Cluster
, \1 # Comma followed by what found in group 1
)+ # Cluster end, 1 to many times
Note - if you use split and unique then combine, you will lose the ordering of
the items.
An alternative approach is to split the string on the comma, then unique the results, then re-combine for your single text
paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
"http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)
You can use functions from tidyverse if your strings are in a dataframe:
library(tidyverse)
separate_rows(df, text, sep = ",") %>%
distinct %>%
group_by(no) %>%
mutate(text = paste(text, collapse = ",")) %>%
slice(1)
The output is:
# no text
# <int> <chr>
# 1 1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2 2 http://q.co/imag/qrs.png