extract string using same pattern in a text using R

extract string using same pattern in a text using R - r

I'm trying to deal with text with R and here is my question.
From this source text
#Pray4Manchester# I hope that #ArianaGrande# will be better soon.
I want to extract Pray4Manchester and ArianaGrande using the pattern #.+#, but when I run
str_extract_all(text,pattern="#.+#")
I get
#Pray4Manchester# I hope that #ArianaGrande#
How to solve this? Thanks.

We can do
str_extract_all(text, "(?<=#)\\w*(?=#)")[[1]]
#[1] "Pray4Manchester" "ArianaGrande"
data
text <- "#Pray4Manchester# I hope that #ArianaGrande# will be better soon."

You could use regex to look for results that match text between two hashes that don't contain a space character.
Something like this: ([#]{1}[^\s]+[#]{1})

Related

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)

Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

R - Tokenization - single and two letter words in a TermDocumentMatrix

I am currently trying to do a little bit of text processing and I would like to get the one and two letter words in a TermDocumentMatrix.
The issue is that it seems to display only 3 letter words and more.
library(tm)
library(RWeka)
test<-'This is a test.'
testmyCorpus<-Corpus(VectorSource(test))
testTDF<-TermDocumentMatrix(testmyCorpus, control=list(tokenize=AlphabeticTokenizer))
inspect(testTDF)
Only the words "this" and "test" are displayed. Any ideas?
Thanks a lot for you help!
Robert

Here is the answer to almost your problem: in short, you should add an option control=list(wordLengths=c(1,Inf) to TermDocumentMatrix.

Improperly formatted CSV, how to repair?

I have a csv, and each line reads as follows:
"http://www.videourl.com/video,video title,video duration,thumbnail,<iframe src=""http://embed.videourl.com/video"" frameborder=0 width=510 height=400 scrolling=no> </iframe>,tag 1,tag 2",,,,,,,,,,,,,,,,,,,,,,,,,,
Is there a program I can use to clean this up? I'm trying to import it to wordpress and map it to current fields, but it isn't functioning properly. Any suggestions?

Just use search and replace in this case. remove the commas at the end and then replace the remaining commas with ",".

Should anyone else have the same issue. Know that this solution will only work with data much like the example giving. If data has a lot of text and there are commas within the text that need kept. Then search replacing comma will not work. Using regex would be the next option and that can be done in Notepad ++
However I think the regex pattern depends on the data so not much point creating an example.
PHP could be used to explode each line also. Remove values that match a regex out of many i.e. URL, money. Then what is left could be (depending on the data again) just a block of text. That approach may not work if there are two or more columns with a lot of text

parsing xml file manually with r

for some reason, I cannot download the r xml package at work. I have an xml file that has contents like this:
x<-read.table("info.xml")
x
</name></content></item><item id="id-123"><content><name>
</name></content></item><item id="id-456"><content><name>
</name></content></item><item id="id-5559"><content><name>
I need to pick values that start with id and - and the numbers like
id-123, id-456 id-5559, etc
tried this:
str_extract_all(x, "id-[0-9]")
but is only printing id-1, I really need help very quick. Any ideas?

str_extract_all(x, "id-[0-9]+")

The regular expression "id-[0-9]" is missing a "+" at the end.
There may be more issues, but that one jumps out.

This regex is not right

I am trying to use regex generators to create an expression, but I can't seem to get it right.
What I need to do is find the following type of string in a string:
community_n
For example, within the string which may be
community community_1 community_new_1 community_1_new
from that, I just want to extract community_1
I have tried /(community_\\d+)/, but that is clearly not right.

Try adding word boundries, so
/(\\bcommunity_\\d+\\b)/

Try using the regex (community_\d+).
Though I could be incorrect since I don't know which language you are using.
(For some reason I cannot add comments, I can only answer questions).