R sorting by most commonly occuring - r

This is probably very very simple but, I have a vector of phrases, some of which repeat, some of which dont, and I would like a list of unique phrases, sorted by the most commonly occurring.
e.g.
vec <- c("hello","hi","hi","greetings","good day", "hi", "hello", "good day","good morning","hello","good day")
sort(unique(vec))
[1] "good day" "good morning" "greetings" "hello" "hi"
I would expect "hi" to be first then followed by "hello" then followed by "good day" etc....

Just use sort(table(vec)) :
sort(table(vec), decreasing=TRUE)
# vec
# good day hello hi good morning greetings
# 3 3 3 1 1

Related

Why isn't str_count working with multiple strings?

I have a string with text like this:
Text <- c("How are you","What is your name","Hi my name is","You ate your cake")
And I want an output that counts the number of times the word "you" or "your" appears
Text NumYou
"How are you" 1
"What is your name" 1
"Hi my name is" 0
"You ate your cake" 2
I tried using the str_count function but it was missing occurrences of "you" and "your"
NumYou = str_count(text,c("you","your"))
Why isn't str_count working correctly?
Pass the pattern as one string.
stringr::str_count(tolower(Text),'you|your')
#[1] 1 1 0 2

Finding all terms near a given phrase

Is there a way to find all words that are associated with a given phrase?
For example, say I want to find all of the words next to the word "illness" in a string. The string has the word "illness" quite a bit, and I want to find all of the terms surrounding it, such as "has illness," "does not have illness," "might have illness," etc...
If you just have a vector of strings (isolated or as a column in a data.frame), then perhaps:
s <- c("hello world illness quux bar not action illness",
"not an illness", "no positive", "illness", "illness goodbye")
ret <- lapply(list(gregexpr("(\\S+)(?= illness)", s, perl = TRUE),
gregexpr("(?<=illness )(\\S+)", s, perl = TRUE)),
regmatches, x = s)
Map(c, ret[[1]], ret[[2]])
# [[1]]
# [1] "world" "action" "quux"
# [[2]]
# [1] "an"
# [[3]]
# character(0)
# [[4]]
# character(0)
# [[5]]
# [1] "goodbye"
Each gregexpr finds a word (well, contiguous non-whitespace) followed by the literal " illness" or preceded by the literal "illness ". Because it returns a list with enough information to extract the substrings from the original, we use regmatches(x=s, ...) to extract the components.
(The Map command deals with the fact that the lapply result is a list, length 2, one for each regex. It merely concatenates the substrings from the first regex with the substrings from the second regex, "zipping" the matches for each string within the vector. If you look at ret[[1]] and/or ret as a whole, the benefit of this might make more sense.)
If you don't care in which string in the vector/column the surrounding words are found, then you can simply unlist this:
unlist(ret)
# [1] "world" "action" "an" "quux" "goodbye"
Quite unclear what pattern exactly you want to match. So here's some options, using str_extractfrom the stringrpackage and positive lookahead:
dt <- c("I mean he's got to have some illness I suppose that",
"whether you've had a serious illness or [unclear]",
"No not what illnesses, terminal illness.",
"Oh, Terminal illness, oh, sorry.",
"Illness benefit on a joint life, last survivor plan?")
str_extract(dt, ".*(?=.(i|I)llness)") # any string prior to "illness"
[1] "I mean it's got to be some some" "whether you've had a serious" "No not what illnesses, terminal"
[4] "Oh, Terminal" NA
str_extract(dt, "(some|serious|terminal)(?=.(i|I)llness)") # specific words prior to "illness"
[1] "some" "serious" "terminal" NA NA
str_extract(dt, "\\w+\\b(?=.(i|I)llness)") # last word prior to "illness"
[1] "some" "serious" "what" "Terminal" NA
Using the kwic function from quanteda. You can specify the number of words surrounding the term you are looking for. In the example I'm using 3, but that can be anything. The advantage is that you can store the outcome in a data.frame and then do some more investigations.
Using the example from #Chris Ruehlemann:
library(quanteda)
dt <- c("I mean he's got to have some illness I suppose that",
"whether you've had a serious illness or [unclear]",
"No not what illnesses, terminal illness.",
"Oh, Terminal illness, oh, sorry.",
"Illness benefit on a joint life, last survivor plan?")
out <- kwic(dt, pattern = "illness", window = 3, valuetype = "regex")
out
[text1, 8] to have some | illness | I suppose that
[text2, 6] had a serious | illness | or[ unclear
[text3, 4] No not what | illnesses | , terminal illness
[text3, 7] illnesses, terminal | illness | .
[text4, 4] Oh, Terminal | illness | , oh,
[text5, 1] | Illness | benefit on a
data.frame(out)
docname from to pre keyword post pattern
1 text1 8 8 to have some illness I suppose that illness
2 text2 6 6 had a serious illness or [ unclear illness
3 text3 4 4 No not what illnesses , terminal illness illness
4 text3 7 7 illnesses , terminal illness . illness
5 text4 4 4 Oh , Terminal illness , oh , illness
6 text5 1 1 Illness benefit on a illness

R - put space at word begins with capital letter, for full column

i am having a column from XLSX imported to R, where each row is having a sentence without space, but words begins with Capital letters. tried to use
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
but this is working, if i start converting each row,
Example
1 HowDoYouWorkOnThis
2 ThisIsGreatExample
3 ProgrammingIsGood
Expected is
1 How Do You Work On This
2 This Is Great Example
3 Programming Is Good
Is this what you're after?
s <- c("HowDoYouWorkOnThis", "ThisIsGreatExample", "ProgrammingIsGood");
sapply(s, function(x) trimws(gsub("([A-Z])", " \\1", x)))
# HowDoYouWorkOnThis ThisIsGreatExample ProgrammingIsGood
#"How Do You Work On This" "This Is Great Example" "Programming Is Good"
Or using stringr::str_replace_all:
library(stringr);
trimws(str_replace_all(s, "([A-Z])", " \\1"));
#[1] "How Do You Work On This" "This Is Great Example"
#[3] "Programming Is Good"

Extract words only with R

I have strings like these:
x <-c("DATE TODAY d. 011 + e. 0030 + r. 1061","Now or never d. 003 + e. 011 + g. 021", "Long term is long time (e. 104 to d. 10110)","Time is everything (1012) - /1072, 091A/")
Desired output:
d <- c("DATE TODAY","Now or never","Long term is long time","Time is everything")
After an hour with SO search, I just could not do it. Any help is appreciated.
This bit uses stringr to extract anything containing two or more alphabeticals:
> library(stringr)
> unlist(lapply(str_extract_all(x,"[a-zA-Z][a-zA-Z]+"),paste,collapse=" "))
[1] "DATE TODAY" "Now or never"
[3] "Long term is long time to" "Time is everything"
I'm hoping the "to" missing from your desired output is a mistake on your part. Its a perfectly good word, and you said you wanted to extract words.
The pattern is not very clear. But, based on the example showed, here are a couple of ways to get the expected result.
sub('( .\\.| \\().*', '', x)
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
or
pat1 <- '(?<=[0-9] )[A-Za-z]+(*SKIP)(*F)|[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never" "Long term is long time"
#[4] "Time is everything"
If to is a valid word and the expected result had a typo
pat1 <- '[A-Za-z]{2,}'
sapply(regmatches(x,gregexpr(pat1, x, perl=TRUE)), paste, collapse=" ")
#[1] "DATE TODAY" "Now or never"
#[3] "Long term is long time to" "Time is everything"
I agree with the others that "to" is a valid word. Here's a stringi approach
library(stringi)
stri_replace_all_regex(x, "\\s?[A-Za-z]?[+[:punct:]0-9]", "")
# [1] "DATE TODAY" "Now or never"
# [3] "Long term is long time to" "Time is everything"

how to get value when a variable name is passed as a string

i write this code in R
paste("a","b","c")
which returns the value "abc"
Variable abc has a value of 5(say) how do i get "abc" to give me the value 5 is there any function like as.value(paste("a","b","c")) which will give me the answer 5? I am making my doubt sound simple and this is exactly what i want. So please help me. Thanks in advance
paste("a","b","c") gives "a b c" not "abc"
Anyway, I think you are looking for get():
> abc <- 5
> get("abc")
[1] 5
An addition to Sacha's answer. If you want to assign a value to an object "abc" using paste():
assign(paste("a", "b", "c", sep = ""), 5)
This is certainly one of the most-asked questions about the R language, along with its evil twin brother "How do I turn x='myfunc' into an executable function?"
In summary, get, parse, eval , expression are all good things to learn about. The most useful (IMHO) and least-well-known is do.call , which takes care of a lot of the string-to-object conversion work for you.
Here is an example to demonstrate eval() and get(eval())
a <- 1
b <- 2
var_list <- c('a','b')
for(var in var_list)
{
print(paste(eval(var),' : ', get(eval(var))))
}
This gives:
[1] "a : 1"
[1] "b : 2"
Here is a purrr example to do this for multiple vectors
text1 = "Somewhere over the rainbow"
text2 = "All I want for Christmas is you"
text3 = "All too well"
text4 = "Save your tears"
text5 = "Meet me at our spot"
songs = (map(paste0("text", 1:5), get) %>% unlist)
songs
This gives
[1] "Somewhere over the rainbow"
[2] "All I want for Christmas is you"
[3] "All too well"
[4] "Save your tears"
[5] "Meet me at our spot"

Resources