Im trying to run text analysis on a list of 2000+ rows of keywords, but they are listed like
"Strategy;Management Styles;Organizations"
So when I use tm to remove punctuation it becomes
"StrategyManagement StylesOrganizations"
and I assume this breaks my frequently used terms analysis some how.
Ive tried using
vector<-gsub(';', " ",vector)
but this takes my vector data "List of 2000" and makes it a value, with the description "Large character (3 elements)" when I inspected this Value it gave me a really long list of words and stuff which took forever to load! Any ideas what Im doing wrong?
Should I use gsub on my vector or on my corpus? They are just
vector<-VectorSource(dataset$Keywords)
corpus<-VCorpus(vector)
I tried using
inspect(corpus[[1]])
on my corpus after using gsub to make it a value, but I got error "no applicable method for 'inspect' applied to an object of class "character""
You need to split the data into a vector of strings, one of the ways to do this is by using stringr package as follows;
library(tm)
library(stringr)
vector <- c("Strategy;Management Styles;Organizations")
keywords <- unlist(stringr::str_split(vector, ";"))
vector <- VectorSource(keywords)
corpus <- VCorpus(vector)
inspect(corpus[[1]])
#<<PlainTextDocument>>
# Metadata: 7
#Content: chars: 8
#Strategy
Maybe you can try strsplit
X <- c("Global Mindset;Management","Auditor;Accounting;Selection Process","segmantation;banks;franchising")
res <- Map(function(v) unlist(strsplit(v,";")),X)
such that
> res
$`Global Mindset;Management`
[1] "Global Mindset" "Management"
$`Auditor;Accounting;Selection Process`
[1] "Auditor" "Accounting" "Selection Process"
$`segmantation;banks;franchising`
[1] "segmantation" "banks" "franchising"
Related
i have been trying to remove any words in dfmedia (size 29175) matching any contained in dfvocab (size 6001).
dfmedia: each row is a sentence of words in chinese.
我喜歡吃蘋果; 我愛吃饅頭; 我不喜歡菠菜; 我最討厭蘋果!;我很愛菠菜啊;哪個中國人敢不喜歡饅頭?;哎呀饅頭蘋果菠菜都是食物管人家喜歡否?
dfvocab: 蘋果,饅頭,菠菜
desired result: 我喜歡吃; 我愛吃; 我不喜歡; 我最討厭!;我很愛啊;哪個中國人敢不喜歡?;哎呀都是食物管人家喜歡否?
i don't think the results will be any different in chinese or english since it is a simple match and remove/replace, but i'm including the chinese here just in case since my og data is chinese.
I have tried gsub(), mapply(), and using stringr to bind dfmedia and dfvocab together into one dataframe/removing. however since dfvocab and dfmedia are different sized, I am unsure how to approach this with the suggested methods online.
any help would be really appreciated!!
It's pretty straightforward with gsub. Just paste0 all the vocab together with the regex OR operator and replace with ""
> gsub(paste0(dfvocab, collapse="|"), "", dfmedia)
[1] "我喜歡吃" " 我愛吃" " 我不喜歡" " 我最討厭!" "我很愛啊" "哪個中國人敢不喜歡?"
[7] "哎呀都是食物管人家喜歡否"
(I do not speak or read Chinese.) I imagine with such a large vocab set to be deleted you might need to break the 6000 vocab words in chunks and I suspect it will be slow. You might want to look at the tm package since text mining might a task that would require such operations to be optimized.
Here's a way to build a reproducible example:
> dfmedia <- scan(text="我喜歡吃蘋果; 我愛吃饅頭; 我不喜歡菠菜; 我最討厭蘋果!;我很愛菠菜啊;哪個中國人敢不喜歡饅頭?;哎呀饅頭蘋果菠菜都是食物管人家喜歡否", what="", sep=";")
Read 7 items
>
> dfvocab <- scan(text="蘋果,饅頭,菠菜", what="", sep=",")
Read 3 items
I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.
here is the example data:
example_sentences <- data.frame(doc_id = c(1,2,3),
sentence_id = c(1,2,3),
sentence = c("problem not fixed","i like your service and would tell others","peope are nice however the product is rubbish"))
matching_df <- data.frame(x = c("not","and","however"))
Created on 2019-01-07 by the reprex package (v0.2.1)
I want to add/insert a comma just before a certain word in a character string. for example if my string is:
problem not fixed.
I want to convert this to
problem, not fixed.
The other matching_df contains the words to match (these are Coordinate conjunctions) so if the x is found in matching_df then insert comma + space before the detected word.
I have looked at stringr package but not sure how to achieve this.
Best,
I've no idea what the data frame you're talking about looks like, but I made a simple data frame containing some phrases here:
df <- data.frame(strings = c("problems not fixed.","Help how are you"),stringsAsFactors = FALSE)
I then made a vector of words to put a comma after:
words <- c("problems","no","whereas","however","but")
Then I put the data frame of phrases through a simple for loop, using gsub to substitute the word for a word + comma:
for (i in 1:length(df$strings)) {
string <- df$strings[i]
findWords <- intersect(unlist(strsplit(string," ")),words)
if (!is.null(findWords)) {
for (j in findWords) {
df$strings[i] <- gsub(j,paste0(j,","),string)
}
}
}
Output:
df
strings
1 problems, not fixed.
2 Help how are you
The gsubfn function in the gsubfn package takes a regular expression as the first argument and a list (or certain other objects) as the second argument where the names of the list are strings to be matched and the values in the list are the replacement strings.
library(gsubfn)
gsubfn("\\w+", as.list(setNames(paste0(matching_df$x, ","), matching_df$x)),
format(example_sentences$sentence))
giving:
[1] "problem not, fixed "
[2] "i like your service and, would tell others "
[3] "peope are nice however, the product is rubbish"
this could be a very basic question but honestly, I tried a few solutions on those similar questions but was unable to drive success on my data. It could be because of my data or I am having a hard day and couldn't figure out anything. :(
I have a vector of sentences
vec = c("having many items", "have an apple", "item")
Also, I have a data frame to lemmatize the data
lem = data.frame(pattern = c("(items)|(item)", "(has)|(have)|(having)|(had)"), replacement = c("item", "have"))
lem$pattern = as.character(lem$pattern)
lem$replacement = as.character(lem$replacement)
I want to go through each row in the lem data frame to form a replacement command.
Option 1:
library(stringr) #this is said to be quicker than gsub and my data has 3 mil sentences
vec <- sapply(lem, function(x) str_replace_all(vec, pattern=x$pattern, replacement = x$replacement))
Error in x$pattern : $ operator is invalid for atomic vectors
Option 2:
library(doPar)
vec <- foreach(i = 1:nrow(lem)) %dopar% {
str_replace_all(vec, pattern = lem[i, "pattern"], replacement = lem[i, "replacement"])
}
Option 2 returns a list of 2 vectors: the first one is what I want, the second one is the original, which I don't know why. Also, I tested on my machine, doPar (though using parallel programming) is not as fast as sapply.
Since my data is quite big (3 mil sentences), could somebody recommend an effective method to lemmatize the text data?
Another option is to create a named vector from your pattern and replacement vectors instead of a data frame, and then use str_replace_all directly, like this:
library(stringr)
vec <- c("having many items", "has an apple", "items")
lem <- c("item", "have")
names(lem) <- c("(items)|(item)", "(has)|(have)|(having)|(had)")
str_replace_all(vec, lem)
## "have many item" "have an apple" "item"
You could use the stri_replace_all_regex from the stringi library, which will perform your replacements sequentially:
library(stringi)
stri_replace_all_regex(vec,lem$pattern,lem$replacement,vectorize_all=F)
[1] "have many item" "have an apple" "item"
I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.