Find strings that start and end with certain characters [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm working on a text-mining project with data from twitter. In my data frame, many words are converted into Unicode characters, e.g.
<U+0E2B><U+0E25><U+0E07><U+0E1E>
I want to collect every converted words like above and put them into 1 large string so I can deal with them separately.
Is there any way I can find all the strings that start with <U+ and end with > using R?

Your request is a bit imprecise, so I'm taking the liberty to make a few assumptions on how you want the output.
text <- "Words <Q+0E2B><U+0E2B2>, 1 < 2, <p>
<U+0E2B><U+0E25><U+0E07><U+0E1E> </p> some more words"
regmatches(text, gregexpr("<U\\+[0-9A-Z]{4}>", text))
# "<U+0E2B>" "<U+0E25>" "<U+0E07>" "<U+0E1E>"

Related

convert a docx text into dataset or matrix using R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 days ago.
Improve this question
I have a 150pages microsoft word document with info on books listed. I include an example right here:
So this document contains several pieces of information per each book, there are hundreds of different books listed in this docx file and I want to extract and convert the information in a classic dataset with each column being "Title (en)", "Title (de)"..."Abstract" and so on. In this way I would organize all the info in a dataset which has 1 line per book and each cloumn with attributes like English name, abstract and so on. The point is that not all the books have the same piece of information so sometimes the Abstract (for example) is missing so I would need that cell empty.
How can I do that in R? I am not new in R but I never worked with texts so I am not sure what would be the best approach here.
Thanks

Extract date from URL link / random string [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I would like to extract dates from a column of URL links (5,000 rows of raw data).
Sample of the URL include:
http://en/Pages/Introduction-More_Details-20191103.com
http://en/Pages/United-Kingdom-Page1-EU-20190502.com
http://en/Pages/France-2019-Description-20190612.com
http://en/Pages/telephone-in-the-UK-and-USA-190405.com
Is there any R code that can learn the pattern and extract the date to another column?
Thank you.
The different length of text can be a problem...
At least from your sample it looks like the dates are the only numbers and they always follow a -. You could catch them with regex:
urls <- c('http://en/Pages/Introduction-More_Details-20191103.com',
'http://en/Pages/United-Kingdom-EU-20190502.com',
'http://en/Pages/France-20190612.com',
'http://en/Pages/telephone-in-the-UK-and-USA-190405.com')
gsub('(.*)-(\\d{6,8})(.*)', '\\2', urls)
#[1] "20191103" "20190502" "20190612" "190405"
Or
gsub('(.*)-(\\d{6,8})(\\.com)', '\\2', urls)
Then you save that to a new column. Obviously, how easy it is to pick up all the urls depends on how many different formats you have.

Remove re-occuring text strings [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am new to R and have searched the forum for almost 2 hours now without getting it to work for me.
My problem: I have a long text string scraped from internet. As I scraped code for images were included. The are coded in a way that they start with "Embed from Getty Images" and ends with "false })});\n". I would like to remove everything in between those strings. I have tried gsub() as per:
AmericanTexts3 <- gsub("Embed.*})});\n", "", AmericanTexts)
But what happens then is that they remove everything between the first picture and the last picture. Do anyone know how to solve this?
You need to use a non-greedy regular expression.
Try
AmericanTexts3<-gsub("Embed.*?})});\n","",AmericanTexts)
The ? matches the first occurence of the second part of the regex, so that only the part between the matches should be removed.

Splitting a character into separate words in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am working on a project in R (on TED_Talks data set). I have a data frame with one column called "tags" which contains a character like
"gaming,gender,sex,feminism,education,culture".
The problem is, the whole row is being read as a single character.
I want the output to be a vector containing separate words. eg:
"gaming","gender","sex","feminism","education","culture"
so I can do further analysis on tags.
You can simply do the following:
say your entry is in object a, and you want to allocate the final result to object b:
a <- "gaming,gender,sex,feminism,education,culture"
b <- unlist(strsplit(a, "[,]"))

Find duplicate registers in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have an excel file with a list of emails and channels that collected it. How can I know how many emails per channel are duplicated using R and automate it (every time I import a different file just have to run it and get the results ) ?
Thank you!!
Assuming the "df" dataframe has the relevant variables under the names "channel" and "email", then:
To get the number of unique channel-email pairs:
dim(unique(df[c("channel", "email")]))[1]
To get the sum of all channel-email observations:
sum(table(df$channel, df$email))
To get the number of duplicates, simply subtract the former from the later:
sum(table(df$channel, df$email)) - dim(unique(df[c("channel", "email")]))[1]

Resources