Separate a row into many using separated by [duplicate] - r

This question already has an answer here:
Split Strings into values in long dataframe format [duplicate]
(1 answer)
Closed 3 years ago.
Having into one row of dataframe data like this:
data.frame(text = c("in this line ???? another line and ???? one more", "more lines ???? another row")
separate into many rows using as separation the ????. Here the expected output
data.frame(text = c("in this line", "another line and", "one more", "more lines", "another row")

Here is a base R solution
dfout <- data.frame(text = unlist(strsplit(as.character(df$text),split = " \\?{4} ")))
or a more efficient (Thanks to comments by #Sotos)
dfout <- data.frame(text = unlist(strsplit(as.character(df$text),split = " ???? ", fixed = TRUE)))
such that
> dfout
text
1 in this line
2 another line and
3 one more
4 more lines
5 another row

Related

Count the amount of bigrams in sentences in a dataframe

I have a dataset that looks a bit like this:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
I would like to have a count where I can see the occurrence of certain bigrams. So lets say I have:
trigger_bg_1 <- "sample text"
I expect the output of 2 (as there are two occurrences of "sample text" in the two sentences. I know how to do a word count like this:
trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")
for(i in unlist(words)){
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}
But I cant get something working for a bigram. Any thoughts on how I should change the code to get it working?
But as I have a long test of trigger-words which I need to count in over
In case you want to count the sentences where you have a match you can use grep:
length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2
In case you want to count how many times you find trigger_bg_1 you can use gregexpr:
sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2
You could sum a grepl
sum(grepl(trigger_bg_1, df$sentences))
[1] 2
If you are really interested in bigrams rather than just set word combinations, the package quanteda can offer a more substantial and systematic way forward:
Data:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
Solution:
library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)
Result:
bigrams
in sentence sample text text in sentence 1 sentence 2
2 2 2 1 1
If you want to inspect the frequency count of a specific bigram:
bigrams["in sentence"]
in sentence
2

Keep rows with have a specific word [duplicate]

This question already has answers here:
Filter rows which contain a certain string
(5 answers)
Closed 3 years ago.
Using this command it keeps the rows which have the specific word
df[df$ID == "interesting", ]
If this word is exist in the row but it has more words around how is it possible to find if this word exist and keep the row.
Example input
data.frame(text = c("interesting", " I am interesting for this", "remove")
Expected output
data.frame(text = c("interesting", " I am interesting for this")
1.Example data:
df <- data.frame(text = c("interesting", " I am interesting for this", "remove"),
stringsAsFactors = FALSE)
Solution using base R. Indexing using grepl:
df[grepl("interesting", df$text), ]
This returns:
[1] "interesting" " I am interesting for this"
Edit 1
Change code so that it returns a data.frame and not a vector.
df[grep("interesting", df$text), , drop = FALSE]
This now returns:
text
1 interesting
2 I am interesting for this

Classification based on list of words R

I have a data set with article titles and abstracts that I want to classify based on matching words.
"This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text"
Topic 1 Topic 2 Topic (X)
word1 word4 word(a)
word2 word5 word(b)
word3 word6 word(c)
Given that that text above matches words in Topic 2, I want to assign a new column with this label. Preferred if this could be done with "tidy-verse" packages.
Given the sentence as a string and the topics in a data frame you can do something like this
input<- c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text")
df <- data.frame(Topic1 = c("word1", "word2", "word3"),Topic2 = c("word4", "word5", "word6"))
## This splits on space and punctation (only , and .)
input<-unlist(strsplit(input, " |,|\\."))
newcol <- paste(names(df)[apply(df,2, function(x) sum(input %in% x) > 0)], collapse=", ")
Given I am unsure of the data frame you want to add this too I have made a vector newcol.
If you had a data frame of long sentences then you can use a similar approach.
inputdf<- data.frame(title=c("This is an example of text that I want to classify based on the words that are matched from a list. This would be about 2 - 3 sentences long. word4, word5, text, text, text", "word2", "word3, word4"))
input <- strsplit(as.character(inputdf$title), " |,|\\.")
inputdf$newcolmn <-unlist(lapply(input, function(x) paste(names(df)[apply(df,2, function(y) sum(x %in% y)>0)], collapse = ", ")))

Collapse character vector into single observation in R

How do you reduce a multi-valued vector to a single observation? Specifically, dealing with text. The solution should be scalable.
Consider:
col <- c("This is row 1", "AND THIS IS ROW 2", "Wow, and this is row 3!")
Which returns the following:
> col
[1] "This is row 1" "AND THIS IS ROW 2" "Wow, and this is row 3!"
Where the desired solution looks like this:
> col
[1] "This is row 1 AND THIS IS ROW 2 Wow, and this is row 3!"
You are looking for ?paste:
> paste(col, collapse = " ")
#[1] "This is row 1 AND THIS IS ROW 2 Wow, and this is row 3!"
In this case you want to collapse the strings together and add a space in between them. You can also check out paste0.

R extract numbers while leaving blanks [duplicate]

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 8 years ago.
Consider the input c("foo 1", "bar 2", "baz"). I'd like to turn this into c(1,2,NA) (basically extract the numbers from each string, or if none exist turn it into an NA). My first pass looks like this:
funNums = as.numeric(
regmatches(x$Fun,
regexpr('\\d+', x$Fun, perl = T)))
where x$Fun is my input vector. The output I get from this though is c(1,2) since regmatches throws away things which don't match. How can I get it to include NAs?
X <- c("foo 1", "bar 2", "baz")
as.numeric(gsub("([^[:digit:]]*)", "", X))
# [1] 1 2 NA
(Do be aware that when passed a string like "1 to 2", this will return the number 12, which may not be what you'd like it to do.)

Resources