How do you reduce a multi-valued vector to a single observation? Specifically, dealing with text. The solution should be scalable.
Consider:
col <- c("This is row 1", "AND THIS IS ROW 2", "Wow, and this is row 3!")
Which returns the following:
> col
[1] "This is row 1" "AND THIS IS ROW 2" "Wow, and this is row 3!"
Where the desired solution looks like this:
> col
[1] "This is row 1 AND THIS IS ROW 2 Wow, and this is row 3!"
You are looking for ?paste:
> paste(col, collapse = " ")
#[1] "This is row 1 AND THIS IS ROW 2 Wow, and this is row 3!"
In this case you want to collapse the strings together and add a space in between them. You can also check out paste0.
Related
I have a dataset that looks a bit like this:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
I would like to have a count where I can see the occurrence of certain bigrams. So lets say I have:
trigger_bg_1 <- "sample text"
I expect the output of 2 (as there are two occurrences of "sample text" in the two sentences. I know how to do a word count like this:
trigger_word_sentence <- 0
for(i in 1:nrow(df)){
words <- df$sentences[i]
words = strsplit(words, " ")
for(i in unlist(words)){
if(i == trigger_word_sentence){
trigger_word_sentence = trigger_word_sentence + 1
}
}
}
But I cant get something working for a bigram. Any thoughts on how I should change the code to get it working?
But as I have a long test of trigger-words which I need to count in over
In case you want to count the sentences where you have a match you can use grep:
length(grep(trigger_bg_1, sentences, fixed = TRUE))
#[1] 2
In case you want to count how many times you find trigger_bg_1 you can use gregexpr:
sum(unlist(lapply(gregexpr(trigger_bg_1, sentences, fixed = TRUE)
, function(x) sum(x>0))))
#[1] 2
You could sum a grepl
sum(grepl(trigger_bg_1, df$sentences))
[1] 2
If you are really interested in bigrams rather than just set word combinations, the package quanteda can offer a more substantial and systematic way forward:
Data:
sentences <- c("sample text in sentence 1", "sample text in sentence 2")
id <- c(1,2)
df <- data.frame(sentences, id)
Solution:
library(quanteda)
# strip sentences down to words (removing punctuation):
words <- tokens(sentences, remove_punct = TRUE)
# make bigrams, tabulate them and sort them in decreasing order:
bigrams <- sort(table(unlist(as.character(tokens_ngrams(words, n = 2, concatenator = " ")))), decreasing = T)
Result:
bigrams
in sentence sample text text in sentence 1 sentence 2
2 2 2 1 1
If you want to inspect the frequency count of a specific bigram:
bigrams["in sentence"]
in sentence
2
This question already has an answer here:
Split Strings into values in long dataframe format [duplicate]
(1 answer)
Closed 3 years ago.
Having into one row of dataframe data like this:
data.frame(text = c("in this line ???? another line and ???? one more", "more lines ???? another row")
separate into many rows using as separation the ????. Here the expected output
data.frame(text = c("in this line", "another line and", "one more", "more lines", "another row")
Here is a base R solution
dfout <- data.frame(text = unlist(strsplit(as.character(df$text),split = " \\?{4} ")))
or a more efficient (Thanks to comments by #Sotos)
dfout <- data.frame(text = unlist(strsplit(as.character(df$text),split = " ???? ", fixed = TRUE)))
such that
> dfout
text
1 in this line
2 another line and
3 one more
4 more lines
5 another row
I need to extract first 2 words from a string. If the string contains more than 2 words, it should return the first 2 words else if the string contains less than 2 words it should return the string as it is.
I've tried using 'word' function from stringr package but it's not giving the desired output for cases where len(string) < 2.
word(dt$var_containing_strings, 1,2, sep=" ")
Example:
Input String: Auto Loan (Personal)
Output: Auto Loan
Input String: Others
Output: Others
If you want to use stringr::word(), you can do:
ifelse(is.na(word(x, 1, 2)), x, word(x, 1, 2))
[1] "Auto Loan" "Others"
Sample data:
x <- c("Auto Loan (Personal)", "Others")
Something like this?
a <- "this is a character string"
unlist(strsplit(a, " "))[1:2]
[1] "this" "is"
EDIT:
To add the part where original string is returned if number of worlds is less than 2, a simple if-else function can be used:
a <- "this is a character string"
words <- unlist(strsplit(a, " "))
if (length(words) > 2) {
words[1:2]
} else {
a
}
You could use regex in base R using sub
sub("(\\w+\\s+\\w+).*", "\\1", "Auto Loan (Personal)")
#[1] "Auto Loan"
which will also work if you have only one word in the text
sub("(\\w+\\s+\\w+).*", "\\1", "Auto")
#[1] "Auto"
Explanation :
Here we extract the pattern shown inside round brackets which is (\\w+\\s+\\w+) which means :
\\w+ One word followed by \\s+ whitespace followed by \\w+ another word, so in total we extract two words. Extraction is done using backreference \\1 in sub.
My question is very similar to the question below, with the added problem that I need to split by a double-space.
Split column at delimiter in data frame
I would like to split this vector into columns.
text <- "first second and second third and third and third fourth"
The result should be four columns reading "first", "second and second", "third and third and third", "fourth"
We can use \\s{2,} to match the pattern of space that are 2 or more in strsplit
v1 <- strsplit(text, "\\s{2,}")[[1]]
v1
#[1] "first" "second and second"
#[3] "third and third and third" "fourth"
This can be converted to data.frame using as.data.frame.list
setNames(as.data.frame.list(v1), paste0("col", 1:4))
i want to merge two columns of my data set, The nature of these two columns are either/or, i.e if a value is present in one column it wont be present in other column.
i tried these
temp<-list(a=1:3,b=10:14)
paste(temp$a,temp$b)
output
"1 10" "2 11" "3 12" "1 13" "2 14"
and this
temp<-list(a=1:3,b=10:14,c=20:25)
temp<-within(temp,a <- paste(a, b, sep=''))
output
temp$a
[1] "110" "211" "312" "113" "214"
but what i am looking for is to replace the values when they are not present . for example temp$a only have 1:3 and temp$b have 10:14 , i.e two extra values - so i want my answer to be
1_10 2_11 3_12 _13 _14
EDIT -please look that i do not want column c to be concatenated with a and $b
Using stri_list2matrix, we can fill the list elements that have shorter length with '' and use paste.
library(stringi)
do.call(paste, c(as.data.frame(stri_list2matrix(temp, fill='')), sep='_'))
#[1] "1_10" "2_11" "3_12" "_13" "_14"
stri_list2matrix(temp, fill='') converts the list to matrix after filling the list elements that are shorter in length with ''. Convert it to data.frame (as.data.frame) and use do.call(paste to paste the elements in each row separated by _ (sep='_').
Update
Based on the edited 'temp', if you are interested only in the first two elements of 'temp'
do.call(paste, c(as.data.frame(stri_list2matrix(temp[1:2], fill='')),
sep='_'))
#[1] "1_10" "2_11" "3_12" "_13" "_14"
You can also subset by the names ie. temp[c('a', 'b')]
Expand the length of the shorter vector to match the length of the longer vector, then paste:
paste(c(temp$a,rep("",length(temp$b)-length(temp$a))), temp$b, sep="_")
#[1] "1_10" "2_11" "3_12" "_13" "_14"