I have the following code in tidyverse and list of words in words.xlsx like:
hello
world
program
data
analysis
v1 = read_excel('words.xlsx') %>%
mutate(words = tolower(words))%>%
pull(1)
for(v in v1){
data1 = data1 %>%
mutate(!! v := as.integer(heading %like% v))
}
I want to edit this code, so that instead of an integer, I get the actual words which were found in every string (separated with a comma) like in the image
You can paste all the words in v1 with word boundaries and use str_extract_all to extract any word in v1 present in data1$heading. str_extract_all would return list of words, we can use sapply to get them as one concatenated string.
sapply(stringr::str_extract_all(data1$heading,
paste0('\\b', v1, '\\b', collapse = '|')), function(x) toString(unique(x)))
Related
I have a series of strings that have a particular set of characters. What I'd like to do is be able to extract just the word from the string with those characters in it, and discard the rest.
I've tried various regex expressions to do it but I either get it to split all the words or it returns the entire string. Following is an example of the kinds of strings. I've been trying to use stringr::str_extract_all() as there are instances where there are more than one word that needs to be pulled out.
data <- c("AlvariA?o, 1961","Andrade-Salas, Pineda-Lopez & Garcia-MagaA?a, 1994", "A?vila & Cordeiro, 2015", "BabiA?, 1922")
result <- unlist(stringr::str_extract_all(data, "regex"))
From this I'd like a result that pulls all the words that has the "A?", like this:
result <- c("AlvariA?o", "MagaA?a", "A?vila", "BabiA"?)
It seems really simple but my regex knowledge is just not cutting it at the moment.
To match ? it needs to be escaped with \\?, so A\\? will match A?. \\w matches any word character (equivalent to [a-zA-Z0-9_]) and * matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy).
unlist(stringr::str_extract_all(data, "\\w*A\\?\\w*"))
#[1] "AlvariA?o" "MagaA?a" "A?vila" "BabiA?"
I made as function but pretty worse than Gki's...
library(quanteda)
set_of_character <- function(dummy, key){
n <- nchar(key)
dummy %>% str_split(., " ") %>%
unlist %>%
str_replace(., ",", "") %>%
sapply(., function(x) {
x %>%
tokens("character") %>%
unlist() %>%
char_ngrams(n, concatenator = "")
}) %>%
sapply(., function(x) (key %in% x)) %>% which(TRUE) %>% names %>%
return
}
for your example,
set_of_character(data, "A?")
[1] "AlvariA?o" "Garcia-MagaA?a" "A?vila" "BabiA?"
I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:
#[1] "never give up uouo cbbuk jeez"
would become
#[1*] "never give up"
as the words "never","give",and "up" are all in the custom dictionary.
I have previously tried the following:
#Reading the custom dictionary as a function
english.words <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
DF2 <- DF1[(english.words(DF1$Text)),]
but my result is a character list with one word. Any advice?
You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.
DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x)
paste0(Filter(english.words, x), collapse = ' '))
Here I have created a new column called Text1 with only english words, if you want to replace the original column you can save the output in DF1$Text.
Since you use a dataframe you could try this:
library(tidyverse)
library(tidytext)
dat<-tibble(text="never give up uouo cbbuk jeez")
words_to_keep<-c("never","give","up")
keep_function<-function(data,words_to_keep){
data %>%
unnest_tokens(word, text) %>%
filter(word %in% words_to_keep) %>%
nest(text=word) %>%
mutate(text = map(text, unlist),
text = map_chr(text, paste, collapse = " "))
}
keep_function(dat,words_to_keep)
I want to extract everything but a pattern and return this concetenated in a string.
I tried to combine str_extract_all together with sapply and cat
x = c("a_1","a_20","a_40","a_30","a_28")
data <- tibble(age = x)
# extracting just the first pattern is easy
data %>%
mutate(age_new = str_extract(age,"[^a_]"))
# combining str_extract_all and sapply doesnt work
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep="")))
class(str_extract_all(x,"[^a_]"))
sapply(str_extract_all(x,"[^a_]"),function(x) cat(x,sep=""))
Returns NULL instead of concatenated patterns
Instead of cat, we can use paste. Also, with tidyverse, can make use of map and str_c (in place of paste - from stringr)
library(tidyverse)
data %>%
mutate(age_new = map_chr(str_extract_all(x, "[^a_]+"), ~ str_c(.x, collapse="")))
using `OP's code
data %>%
mutate(age_new = sapply(str_extract_all(x,"[^a_]"),
function(x) paste(x,collapse="")))
If the intention is to get the numbers
library(readr)
data %>%
mutate(age_new = parse_number(x))
Here is a non tidyverse solution, just using stringr.
apply(str_extract_all(column,regex_command,simplify = TRUE),1,paste,collapse="")
'simplify' = TRUE changed str_extract_all to output a matrix, and apply iterates over the matrix. I got the idea from https://stackoverflow.com/a/4213674/8427463
Example: extract all 'r' in rownames(mtcar) and concatenate as a vector
library(stringr)
apply(str_extract_all(rownames(mtcars),"r",simplify = TRUE),1,paste,collapse="")
I have words against their synonyms. In the different data frame, I have sentences. I want to search synonyms from the other dataframe. If found, replace it with word for which synomym found.
dt = read.table(header = TRUE,
text ="Word Synonyms
Use 'employ, utilize, exhaust, spend, expend, consume, exercise'
Come 'advance, approach, arrive, near, reach'
Go 'depart, disappear, fade, move, proceed, recede, travel'
Run 'dash, escape, elope, flee, hasten, hurry, race, rush, speed, sprint'
Hurry 'rush, run, speed, race, hasten, urge, accelerate, bustle'
Hide 'conceal, cover, mask, cloak, camouflage, screen, shroud, veil'
", stringsAsFactors= F)
mydf = read.table(header = TRUE, , stringsAsFactors= F,
text ="sentence
'I can utilize this file'
'I can cover these things'
")
The desired output looks like -
I can Use this file
I can Hide these things
Above is just a sample. In my real dataset, I have more than 10000 sentences.
One can replace , in dt$Synonyms with | so that it can be used as pattern argument of gsub. Now, use dt$Synonyms as pattern and replace occurrence of any word (separated by |) with dt$word. One can use sapply and gsub as:
Edited: Added word-boundary check (as part of pattern in gsub) as suggested by OP.
# First replace `, ` with `|` in dt$Synonyms. Now dt$Synonyms can be
# used 'pattern' argument of `gsub`.
dt$Synonyms <- paste("\\b",gsub(", ","\\\\b|\\\\b",dt$Synonyms),"\\b", sep = "")
# Loop through each row of 'dt' to replace Synonyms with word using sapply
mydf$sentence <- sapply(mydf$sentence, function(x){
for(row in 1:nrow(dt)){
x = gsub(dt$Synonyms[row],dt$Word[row], x)
}
x
})
mydf
# sentence
# 1 I can Use this file
# 2 I can Hide these things
Here is a tidyverse solution...
library(stringr)
library(dplyr)
dt2 <- dt %>%
mutate(Synonyms=str_split(Synonyms, ",\\s*")) %>% #split into words
unnest(Synonyms) #this results in a long dataframe of words and synonyms
mydf2 <- mydf %>%
mutate(Synonyms=str_split(sentence, "\\s+")) %>% #split into words
unnest(Synonyms) %>% #expand to long form, one word per row
left_join(dt2) %>% #match synonyms
mutate(Word=ifelse(is.na(Word), Synonyms, Word)) %>% #keep unmatched words the same
group_by(sentence) %>%
summarise(sentence2=paste(Word, collapse=" ")) #reconstruct sentences
mydf2
sentence sentence2
<chr> <chr>
1 I can cover these things I can Hide these things
2 I can utilize this file I can Use this file
I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system
Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"
There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "