this could be a very basic question but honestly, I tried a few solutions on those similar questions but was unable to drive success on my data. It could be because of my data or I am having a hard day and couldn't figure out anything. :(
I have a vector of sentences
vec = c("having many items", "have an apple", "item")
Also, I have a data frame to lemmatize the data
lem = data.frame(pattern = c("(items)|(item)", "(has)|(have)|(having)|(had)"), replacement = c("item", "have"))
lem$pattern = as.character(lem$pattern)
lem$replacement = as.character(lem$replacement)
I want to go through each row in the lem data frame to form a replacement command.
Option 1:
library(stringr) #this is said to be quicker than gsub and my data has 3 mil sentences
vec <- sapply(lem, function(x) str_replace_all(vec, pattern=x$pattern, replacement = x$replacement))
Error in x$pattern : $ operator is invalid for atomic vectors
Option 2:
library(doPar)
vec <- foreach(i = 1:nrow(lem)) %dopar% {
str_replace_all(vec, pattern = lem[i, "pattern"], replacement = lem[i, "replacement"])
}
Option 2 returns a list of 2 vectors: the first one is what I want, the second one is the original, which I don't know why. Also, I tested on my machine, doPar (though using parallel programming) is not as fast as sapply.
Since my data is quite big (3 mil sentences), could somebody recommend an effective method to lemmatize the text data?
Another option is to create a named vector from your pattern and replacement vectors instead of a data frame, and then use str_replace_all directly, like this:
library(stringr)
vec <- c("having many items", "has an apple", "items")
lem <- c("item", "have")
names(lem) <- c("(items)|(item)", "(has)|(have)|(having)|(had)")
str_replace_all(vec, lem)
## "have many item" "have an apple" "item"
You could use the stri_replace_all_regex from the stringi library, which will perform your replacements sequentially:
library(stringi)
stri_replace_all_regex(vec,lem$pattern,lem$replacement,vectorize_all=F)
[1] "have many item" "have an apple" "item"
Related
Im trying to run text analysis on a list of 2000+ rows of keywords, but they are listed like
"Strategy;Management Styles;Organizations"
So when I use tm to remove punctuation it becomes
"StrategyManagement StylesOrganizations"
and I assume this breaks my frequently used terms analysis some how.
Ive tried using
vector<-gsub(';', " ",vector)
but this takes my vector data "List of 2000" and makes it a value, with the description "Large character (3 elements)" when I inspected this Value it gave me a really long list of words and stuff which took forever to load! Any ideas what Im doing wrong?
Should I use gsub on my vector or on my corpus? They are just
vector<-VectorSource(dataset$Keywords)
corpus<-VCorpus(vector)
I tried using
inspect(corpus[[1]])
on my corpus after using gsub to make it a value, but I got error "no applicable method for 'inspect' applied to an object of class "character""
You need to split the data into a vector of strings, one of the ways to do this is by using stringr package as follows;
library(tm)
library(stringr)
vector <- c("Strategy;Management Styles;Organizations")
keywords <- unlist(stringr::str_split(vector, ";"))
vector <- VectorSource(keywords)
corpus <- VCorpus(vector)
inspect(corpus[[1]])
#<<PlainTextDocument>>
# Metadata: 7
#Content: chars: 8
#Strategy
Maybe you can try strsplit
X <- c("Global Mindset;Management","Auditor;Accounting;Selection Process","segmantation;banks;franchising")
res <- Map(function(v) unlist(strsplit(v,";")),X)
such that
> res
$`Global Mindset;Management`
[1] "Global Mindset" "Management"
$`Auditor;Accounting;Selection Process`
[1] "Auditor" "Accounting" "Selection Process"
$`segmantation;banks;franchising`
[1] "segmantation" "banks" "franchising"
here is the example data:
example_sentences <- data.frame(doc_id = c(1,2,3),
sentence_id = c(1,2,3),
sentence = c("problem not fixed","i like your service and would tell others","peope are nice however the product is rubbish"))
matching_df <- data.frame(x = c("not","and","however"))
Created on 2019-01-07 by the reprex package (v0.2.1)
I want to add/insert a comma just before a certain word in a character string. for example if my string is:
problem not fixed.
I want to convert this to
problem, not fixed.
The other matching_df contains the words to match (these are Coordinate conjunctions) so if the x is found in matching_df then insert comma + space before the detected word.
I have looked at stringr package but not sure how to achieve this.
Best,
I've no idea what the data frame you're talking about looks like, but I made a simple data frame containing some phrases here:
df <- data.frame(strings = c("problems not fixed.","Help how are you"),stringsAsFactors = FALSE)
I then made a vector of words to put a comma after:
words <- c("problems","no","whereas","however","but")
Then I put the data frame of phrases through a simple for loop, using gsub to substitute the word for a word + comma:
for (i in 1:length(df$strings)) {
string <- df$strings[i]
findWords <- intersect(unlist(strsplit(string," ")),words)
if (!is.null(findWords)) {
for (j in findWords) {
df$strings[i] <- gsub(j,paste0(j,","),string)
}
}
}
Output:
df
strings
1 problems, not fixed.
2 Help how are you
The gsubfn function in the gsubfn package takes a regular expression as the first argument and a list (or certain other objects) as the second argument where the names of the list are strings to be matched and the values in the list are the replacement strings.
library(gsubfn)
gsubfn("\\w+", as.list(setNames(paste0(matching_df$x, ","), matching_df$x)),
format(example_sentences$sentence))
giving:
[1] "problem not, fixed "
[2] "i like your service and, would tell others "
[3] "peope are nice however, the product is rubbish"
I've found variants on this issue but can't get the suggested solutions working in my situation. I'm pretty new to R with no other coding experience so it may be I'm just missing something basic. Thanks for any help!!
I have a data table with a column of names of organisations, call it Orgs$OrgName. Sometimes there are misspellings of words within the strings that make up the organisation names. I have a look-up table (imported from csv with common misspellings in one column (spelling$misspelt) and their corrections in another column (spelling$correct).
I want to find any parts of OrgName strings which match spelling$misspelt and replace just those parts with spelling$correct.
I have tried various solutions based on mgsub, stri_replace_all_fixed, str_replace_all (replacement of words in strings has been my main reference). But nothing has worked and all the examples appear to be based on manually created vectors using vect1 <- c("item1", "item2", item3") rather than based on a lookup table.
Example of my data:
OrgName
1: WAIROA DISTRICT COUNCIL
2: MANUTAI MARAE COMMITTEE
3: C S AUTOTECH LTD
4: NEW ZEALAND INSTITUTE OF SPORT
5: BRAUHAUS FRINGS
6: CHRISTCHURCH YOUNG MENS CHRISTIAN ASSOCIATION
The lookup table:
mispelt correct
1 ABANDONNED ABANDONED
2 ABERATION ABERRATION
3 ABILITYES ABILITIES
4 ABILTIES ABILITIES
5 ABILTY ABILITY
6 ABONDON ABANDON
(There's no misspellings in the first few lines of org names but there's 57000+ more in the dataset)
UPDATE: Here's what I have tried based on the update to the second response (trying that first as it's simpler). It hasn't worked, but hopefully someone can see where it's gone wrong?
library("stringi")
Orgs <- data.frame(OrgNameClean$OrgNameClean)
head(Orgs)
head(OrgNameClean)
write.csv(spelling$mispelt,file = "wrong.csv")
write.csv(spelling$correctspelling,file = "corrected.csv")
patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
head(patterns)
head(replacements)
for(i in 1:nrow(Orgs)) {
row <- Orgs[i,]
print(as.character(row))
#print(stri_replace_all_fixed(row, patterns, replacements,
vectorize_all=FALSE))
row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+%
"\\b", replacements, vectorize_all=FALSE)
print(row)
Orgs[i,] <- row
}
head(Orgs)
Orgsdt <- data.table(Orgs)
head(Orgsdt)
chckspellchk <- Orgsdt[OrgNameClean.OrgNameClean %like% "ENVIORNMENT",,]
##should return no rows if spelling correction worked
head(chckspellchk)
#OrgNameClean.OrgNameClean
#1: SMART ENVIORNMENTAL LTD
UPDATE 2: more information - there are spaces in the spelling lookup if that makes a difference:
> head(spelling[mispelt %like% " ",,])
mispelt correctspelling
1: COCA COLA COCA
2: TORTISE TORTOISE
> head(spelling[correctspelling %like% " "])
mispelt correctspelling
1: ABOUTA ABOUT A
2: ABOUTIT ABOUT IT
3: ABOUTTHE ABOUT THE
4: ALOT A LOT
5: ANYOTHER ANY OTHER
6: ASFAR AS FAR
We can use stringi's stri_replace_*_all() to do multiple replacements on a whole string.
library("stringi")
string <- "WAIROA ABANDONNED COUNCIL','C S AUTOTECH LTD', 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES"
mistake <- c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW')
corrected <- c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD')
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
Output:
[1] "WAIROA ABANDONED COUNCIL','C S AUTOTECH SGM', 'OLD ZEALAND INSTITUTE OF ABERRATION ABILITIES"
Some notes:
stri_replace_all_fixed replaces occurrences of a fixed pattern matches.
stri_replace_all_regex uses a regular expression pattern instead. This allows us to specify word boundaries: \b to avoid substring matches (an alternative to \bword\b is (?<=\W)word(?=\W)).
vectorize_all is set to FALSE, otherwise each replacement is applied to a new copy of the original sentence. See details here.
Full sample:
library("stringi")
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL',
' SMART ENVIORNMENTAL LTD',
'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
stringsAsFactors = FALSE)
patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
for(i in 1:nrow(Orgs)) {
row <- Orgs[i,]
print(as.character(row))
row <- stri_replace_all_fixed(row, patterns, replacements, vectorize_all=FALSE)
#row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
print(row)
Orgs[i,] <- row
}
PS: I've made a separate CSV with a single headerless column for each character vector. But there are many other ways to read a CSV with R and convert the columns to a char vector.
PS2: If you want substring matches, eg. match ENVIORNMENT in ENVIORNMENTAL do not use stri_replace_all_regex() along with word boundaries \b. This is a great tutorial to buff-up your regex skills.
This answer is potentially too complicated for a new programmer, and I may be writing this more like Python than R (I'm getting a bit rusty on the latter)* but I have a proposed solution for your problem, which isn't trivial by the way. The issues I foresee you having with other solutions you looked at is that they individually only address one small part of the larger puzzle, which is that you need to be able to check every word inside every string against your lookup table. The simplest way I see to do this is to write a number of small functions to do what you need and then use R's family of apply functions to loop through entries and use the functions.
The only other tricky thing here is using an R environment as your lookup table. For whatever reason in R people don't seem to talk much about or really use hash tables (the real name for a lookup table) but they are very common in other languages. Luckily R's environments are actually just an implementation of a C hash table, which is good because hashes are very fast and allow you to directly map one value to another. (More on this here, if interested.)
*I welcome comments or edits from others that would make my answer simpler or more R-idiomatic
# Some example data - note stringsAsFactors=FALSE is critical for this to work
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL',
'C S AUTOTECH LTD',
'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
stringsAsFactors = FALSE)
spelling_df <- data.frame("Mistake" = c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW'),
"Correct"= c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD'),
stringsAsFactors = FALSE)
# Function to convert your data frame to a hash table
create_hash <- function(in_df){
hash_table <- new.env(hash=TRUE)
for(i in seq(nrow(in_df)))
{
hash_table[[in_df[i, 1]]] <- in_df[i, 2]
}
return(hash_table)
}
# Make the hash table out of your data frame
spelling_hash <- create_hash(spelling_df)
# Try it out:
print(spelling_hash[['ABANDONNED']]) # prints ABANDONED
# Now make a function to apply the lookup - and ensure
# if the string is not in the lookup table, you return the
# original string instead (instead of NULL)
apply_hash <- function(in_string, hash_table=spelling_hash){
x = hash_table[[in_string]]
if(!is.null(x)){
return(x)
}
else{
return(in_string)
}
}
# Finally make a function to break the full company name apart,
# apply the lookup on each word, and then paste it back together
correct_spelling <- function(bad_string) {
split_string <- strsplit(as.character(bad_string), " ")
new_split <- lapply(split_string[[1]], apply_hash)
return(paste(new_split, collapse=' '))
}
# Make a new field that applies the spelling correction
Orgs$Corrected <- sapply(Orgs$OrgName, correct_spelling)
I came across a similar issue and might have a tidyverse-style solution.
stringr::str_replace_all should let us do multiple replacements using a named vector.
With the lookup data frame of misspelled and corrected values we could turn that into a named vector. Then we could use that named vector as a lookup in str_replace_all.
Here is an example using some of the misspelled and corrected values provided previously.
library(tidyverse)
# load data frame of misspelled and corrected values
foo <- read_csv("mispelt, correct
ABANDONNED, ABANDONED
ABERATION, ABERRATION
ABILITYES, ABILITIES
ABILTIES, ABILITIES
ABILTY, ABILITY
ABONDON, ABANDON
COCA COLA, COCA
TORTISE, TORTOISE
ABOUTA, ABOUT A
ABOUTIT, ABOUT IT
ABOUTTHE, ABOUT THE
ALOT, A LOT
ANYOTHER, ANY OTHER
ASFAR, AS FAR",
col_types = "c")
# str_replace_all requires a named vector of replacements
# the value of the vector is the correction,
# while the name of each value is the search string to replace
lookup <- foo$correct
names(lookup) <- foo$mispelt
# data frame to test our lookup named vector
tbl <- tibble(old = foo$mispelt)
# mutating to a new column to show replacement works,
# but we could just overwrite the old column as well using mutate
mutate(tbl, new = str_replace_all(old, lookup))
I did not deal with upper or lower case considerations as I'm just demonstrating the named vector usage in str_replace_all and the examples were all upper case. However, regular expressions and/or the regex function could probably help with that if necessary.
Session info:
|package |loadedversion |
|:---------|:-------------|
|dplyr |1.0.7 |
|readr |2.0.0 |
|stringr |1.4.0 |
|tibble |3.1.3 |
|tidyverse |1.3.1 |
I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.
I have millions of Keywords in a column labeled Keyword.text. Each factor or Keyword can contains multiple words (or shall we say token). Here is an example with 4 keywords
Keyword.text
The quick brown fox the
.8 .crazy lazy dog
dog
jumps over+the 9
I'd like to count the number of tokens in each Keyword, so as to obtain:
Keyword.length
5
4
1
4
I installed the Tau package but I haven't gotten very far...
textcnt(Mydf$Keyword.text, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
returns an error I don't understand. Maybe it's due to having factors; it worked fine when practicing with a string.
I know how to do it in excel, but it doesn't work for the last line. If A2 has the keywords then: =LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1 would do
Edit : For a dataframe and the total number of keywords, just use strsplit. There's no need to use strcnt if you're not interested in the counts per keyword. That's where I got you wrong :
tt <- data.frame(
a=rnorm(3),
b=rnorm(3),
c=c("the quick fox lazy","rbrown+fr even","what what goes & around"),
stringsAsFactors=F
)
sapply(tt$c, function(n){
length(strsplit(n, split = "[[:space:][:punct:]]+")[[1]])
})
To read the data, take also a look at ?readLines and/or ?scan. This preserves the string format and allows you to process the file line by line (or row per row). If you use a file connection, you can even load the file in parts, which helps you when you hit memory limits.
A simple example using readLines :
con <- textConnection("
The lazy fog+fog fog
never ended for fog jumping over the
fog whatever . $ plus.
")
# You use con <- file("myfile.txt")
Text <- readLines(con)
sapply(Text,textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
On a sidenote, using the option Dirk mentioned (stringsAsFactors=F) won't slow down performance compared to the usual read.table command. In contrary actually. You should use the sapply as mentioned above, but replace Text with as.character(Mydf$Keyword.text) (or use the stringsAsFactors=F option and drop the as.character().
Please show the error.
Also try:
require(tau)
textcnt(as character(Mydf$Keyword.txt), split, ....)
... to force character mode.
Or load your data with stringsAsFactors=FALSE -- the same question has come up here before.
What about a nice little function that let us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6