I am having problems with replacing parts of a single string with a set of vector replacements, to result in a vector.
I have a string tex which is intended to tell a diagram what text to put as the node (and other) labels.
So if tex is "!label has frequency of !frequency"
and T has columns label with values c("chickens","ducks",...) and frequency with values c("chickens","ducks",...) amongst others,
the function returns a vector like c("Chickens has frequency of 35","Ducks has frequency of 12",...)
More formally, the problem is:
Given a tibble T and a string tex,
return a vector with length nrow(T), of which each element = tex but in which each occurrence within tex of the pattern !pattern is replaced by the vectorised contents of T$pattern
I looked at
Replace string in R with patterns and replacements both vectors
and
String replacement with multiple string but they don't fit my usecase.
stringr::str_replace() doesn't do it either.
possible baseR-solution using sprintf()
animals = c("chickens","ducks")
frequency = c(35,12)
sprintf( "%s has frequency of %s", animals, frequency)
[1] "chickens has frequency of 35" "ducks has frequency of 12"
also,
tex = "%s has frequency of %s"
sprintf( tex, animals, frequency )
will gave the same results.
Related
I have a dataframe (data) with a column containing text from reports (data$Report_Text). I need to extract 40 characters before and after a keyword (including the keyword) for each row and store as a new column in the dataframe.
So far I have this for the characters before (ideally would like to store the text before + after in one column, but if that isn't possible I can do two columns):
data$characters <- sub('.*?(\\d{40}) keyword', "", data$Report_Text)
However when I run this, it gives me all of the text before the keyword, not just 40 characters. Where am I going wrong?
data$characters <- gsub("^.*(.{40}keyword.{40}).*$", "\\1", data$Report_Text))
posibly changing the . before the {40} by \\d (only digits) or the character type of your preference.
I want to start using sprintf to plot a series of two strings in R for a title of a figure. Can anyone show me how to do it correctly? The values from HS and score should be plotted as characters behind the terms in quotes.
title = sprintf ("HS %s", as.character(HS), "Score %s", as.character(score))
With sprintf, we can multiple arguments as the usage is
sprintf(fmt, ...)
That implies, there would be a single fmt and any number of inputs
sprintf("%HS %s Score %s", as.character(HS), as.character(score))
I have a character vector that stores different sentences. Some sentences include an alphanumeric string of a fixed length (15 characters), but some don't. Such string(s) of interest might vary in their combination of alphanumeric characters, but regardless will always be comprised of:
upper case letters
lower case letters
digits
no special characters
and will always be length of 15.
In some vector elements, there will be leading and trailing blank space before/after the string of interest. However, in other elements the string might show immediately at the beginning, or otherwise at the end. The most complex situation is when the string of interest shows without any spaces before/after, which means that it's embedded within another string.
I want to take such a vector and manipulate to return a new vector of the same length, but:
in elements that contained a string of interest, return only the string of interest.
in elements that did not contain a string of interest, return NA.
Example
set.seed(2020)
library(stringi)
library(stringr)
vector_of_strings <- stri_rand_strings(n = 100, length = 15, pattern = "[A-Za-z0-9]")
my_sentences <-
c(
str_interp("my sentece contains ${sample(vector_of_strings, size = 1)}"),
str_interp("${sample(vector_of_strings, size = 1)} is in my sentence"),
str_interp("sometimes - ${sample(vector_of_strings, size = 1)} - it shows like this"),
str_interp("other times it could be${sample(vector_of_strings, size = 1)}without any space before or after"),
"occasionally there's no string of interest, so such element should become 'NA'"
)
my_sentences
## [1] "my sentece contains 8OarR1YUGPBoRfi"
## [2] "WoV8ym3WB2zg2TD is in my sentence"
## [3] "sometimes - pmMk73q0L73qKUa - it shows like this"
## [4] "other times it could be1qvzWei5FxPtRGXwithout any space before or after"
## [5] "occasionally there's no string of interest, so such element should become 'NA'"
How can I take my_sentences and have it return the following?
[1] "8OarR1YUGPBoRfi"
[2] "WoV8ym3WB2zg2TD"
[3] "pmMk73q0L73qKUa"
[4] "1qvzWei5FxPtRGX"
[5] NA
EDIT
Based on ekoam's comment, I wonder whether the following mechanism could be utilized.
(1) First step, test whether any of the strings in vector_of_strings exists in my_sentences's elements. If yes, return the string of interest.
(2) Else, if no match, test whether any combination of alphanumeric and 15-character length exists. If there's a definitive single match, return the matched string.
(3).a. Else, if there's more than one possible match, return all possible strings.
(3).b. Else, if there's no match whatsoever, return NA.
For the sake of this example, let's assume that step (1) above is based on matching against sample(vector_of_strings, size = 50). This is for leaving some room for no match (to be able to move forward to step (2).
And just to make it clearer as of the desired output, I'm trying to get it all in a single vector that is of the same length as the original my_sentences, and the output(s) of the "mechanism" described above are at the respective element positions of the original vector.
This is not straightforwardly addressed. All the conditions you mention are easily operationalizable but one: that the string of interest can be embedded within a larger string. What you could do is extract all words that have the right combination of character types but allow for their length to go beyond 15:
library(stringr)
x <- str_extract(my_sentences, "\\b[A-Za-z0-9]{15,}\\b")
x
[1] "heLvIQabKdDTrBC" "KpxeqHQ0Z94X6vG" "UNMcDuUDzPsRU7s" "beZccQAS3rCFZ5UO7without"
[5] NA
This way you would at least be sure not to have overlooked the embedded target strings. If the number of such strings is not too large you could then in a second step isolate the larger-than-15-char strings and remove the unwanted bits:
x[which(nchar(x) > 15)]
[1] "beZccQAS3rCFZ5UO7without"
I have a vector of strings like this:
"1111111221111122111111UUUUUUUUUUUUUUUUUU"
"---1-1---1--111111"
"1111112111 1111" (with blank spaces)
everyone has different length and I want to extract the max value of the each string, for the three examples above the max values would be (2,1,2), but don't know how to do it with the letters or the dash or the blank spaces, all these three are the minimum, i.e., 1 is bigger than "U", "-" and " " and between them is the same.
Any advice?
Best regards
Decompose the problem into independent, solvable steps:
Transform the input into a suitable format
Find the maximum
The we get:
# Separate strings into individual characters
digits_str = strsplit(input, '')
# Convert to correct type
digits = lapply(digits_str, as.integer)
# Perform actual logic, on each input string in turn.
result = vapply(digits, max, integer(1L), na.rm = TRUE)
This uses the lapply and vapply functions which allow you to perform an operation (here first as.integer and then max) on all values in a vector/list.
I need to write a function that finds the most common word in a string of text so that if I define a "word" as any sequence of words.
It can return the most common words.
For general purposes, it is better to use boundary("word") in stringr:
library(stringr)
most_common_word <- function(s){
which.max(table(s %>% str_split(boundary("word"))))
}
sentence <- "This is a very short sentence. It has only a few words: a, a. a"
most_common_word(sentence)
Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.
# Create example string
string <- "This is a very short sentence. It has only a few words."
library(stringr)
most_common_word <- function(string){
string1 <- str_split(string, pattern = " ")[[1]] # Split the string
string2 <- str_trim(string1) # Remove white space
string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
string4 <- tolower(string3) # Convert to lower case
word_count <- table(string4) # Count the word number
return(names(word_count[which.max(word_count)][1])) # Report the most common word
}
most_common_word(string)
[1] "a"
Hope this helps:
most_common_word=function(x){
#Split every word into single words for counting
splitTest=strsplit(x," ")
#Counting words
count=table(splitTest)
#Sorting to select only the highest value, which is the first one
count=count[order(count, decreasing=TRUE)][1]
#Return the desired character.
#By changing this you can choose whether it show the number of times a word repeats
return(names(count))
}
You can use return(count) to show the word plus the number of time it repeats. This function has problems when two words are repeated the same amount of times, so beware of that.
The order function gets the highest value (when used with decreasing=TRUE), then it depends on the names, they get sorted by alphabet. In the case the words 'a' and 'b' are repeated the same amount of times, only 'a' get displayed by the most_common_word function.
Using the tidytext package, taking advantage of established parsing functions:
library(tidytext)
library(dplyr)
word_count <- function(test_sentence) {
unnest_tokens(data.frame(sentence = test_sentence,
stringsAsFactors = FALSE), word, sentence) %>%
count(word, sort = TRUE)
}
word_count("This is a very short sentence. It has only a few words.")
This gives you a table with all the word counts. You can adapt the function to obtain just the top one, but be aware that there will sometimes be ties for first, so perhaps it should be flexible enough to extract multiple winners.