Vectorizing A Custom Function - r

I wrote this function to return the string before a certain character, which goes like so:
strBefore <- function(find, x, last = FALSE, occurence) {
# Checking.
if (class(x)[1] != "character") { stop("The strBefore function only supports objects of character class.") }
# Getting the place of the find, and handling both caes of last.
fullPlace <- gregexpr(find, x)[[1]] # Gets the location of the occurences of find in x.
# Handling the case where last is TRUE.
if (last == TRUE) { place <- max(fullPlace) # Grabbing the latest character index if last is TRUE.
} else { place <- min(fullPlace) } # Otherwise, getting the first space.
# Handles the occurrenceargument if given.
if (!missing(occurrence)) { place <- fullPlace[occurrence] }
# Subsetting the string.
xlen <- nchar(x) # Getting the total number of characters in the string.
x <- substr(x, 1, place - 1) # Minus 1 because we don't want to include the first hit for find.
return(x)
}
Where find is the character you want the string before, x is the character, last asks if you to get before the last occurrence of find, and occurrence designates which occurrence of find to get before (overrides last if given).
If I use it on a single character object, it works fine like so:
> test <- "Hello World"
> test2 <- strBefore(" ", test)
> test2
[1] "Hello"
However, if I use it on a character vector, it cuts each item in the vector at the same place as the first item:
> test <- c("Hello World", "Hi There", "Why Hello")
> test2 <- strBefore(" ", test)
> test2
[1] "Hello" "Hi Th" "Why H"
Now, this link here does provide me with a method for doing what I want:
Using gsub to extract character string before white space in R
However, I do like having the functionality of the "occurrence" argument, which returns the string before the 2nd, 3rd, etc... occurrence of the find argument.
Just as a note, I can vectorize my function with sapply like so:
> test <- c("Hello World", "Hi There", "Why Hello")
> test2 <- sapply(test, function(x) strBefore(" ", x))
> test2
Hello World Hi There Why Hello
"Hello" "Hi" "Why"
Which somewhat solves my problem...but is there a way to do this more cleanly without having to use an apply function? I'm not looking for a solution to what strBefore does, but more a solution to how to vectorize custom functions. Thanks for your time.

Related

How do I write a function to count the characters in a string?

I'm working a practice exercise for a class, and I've reached an impasse. The instructions state:
Write a function that takes a string of text and counts the number of characters. The function should return "There are xx characters in that string."
This is what I have thus far:
w <- "I hope everyone has a good weekend"
answer <- function (nchar) {
statement <- paste("There are", nchar, "characters in that string")
}
I've tried plugging "w" into the function to see if it works, but I'm getting no results. Please bear in mind that I'm new to R.
But I've been wracking my brain over this. Can someone give me a clue as to what I'm missing? Many thanks for any help provided.
nchar is your function to count the number of characters in a string. If you don't want to count the whitespace you could use gsub to remove them from your string and count again the characters. You could use the following code:
w <- "I hope everyone has a good weekend"
answer <- function (x) {
statement <- paste("There are", nchar(x), "characters in that string")
statement
}
answer(w)
#> [1] "There are 34 characters in that string"
answer2 <- function (x) {
statement <- paste("There are", nchar(gsub(" ", "",x))
, "characters in that string")
statement
}
answer2(w)
#> [1] "There are 28 characters in that string"
Created on 2023-02-03 with reprex v2.0.2
You are confusing the function
nchar()
with your function input
Look at the following:
w <- "I hope everyone has a good weekend"
answer <- function (myInputString) { statement <- paste("There are",
nchar(myInputString), "characters in that string")
return(statement) }
Note that you also missed to add return at the end of your function, to specify what the output should be.
Good luck with you journey into coding ;)
Just for a bit of fun - and for you to try to work out what is going on - here are some alternative functions that give the same answer as the built-in nchar but don't actually use it...
This one splits it into a list of single characters, converts it to a vector, and returns the length...
nchar1 <- function(s) length(unlist(str_split(s, "")))
This one converts it into RAW format (a vector of the byte values that are used to encode the string) and returns the length...
nchar2 <- function(s) length(charToRaw(s))
This one uses a while loop to see at which point the substring function substr returns an empty string...
nchar3 <- function(s){
i <- 0
while(substr(s, i+1, i+2) != ""){
i <- i+1
}
return(i)
}
This one uses a similar approach to count how many times we can remove the first character before getting to an empty string...
nchar4 <- function(s){
i <- 0
while(s != ""){
s <- sub(".", "", s)
i <- i + 1
}
return(i)
}
This one might make your head hurt a bit. It uses a similar technique to the last one but uses Recall to call itself until it gets to the point (a blank string) at which it returns an answer.
nchar5 <- function(s, n = 0){
if(s == "") {
return(n)
} else {
Recall(sub(".", "", s), n + 1)
}
}
nchar1("Good luck!")
[1] 10
nchar2("Good luck!")
[1] 10
nchar3("Good luck!")
[1] 10
nchar4("Good luck!")
[1] 10
nchar5("Good luck!")
[1] 10

Re-prompt readline() if the input is invalid

Lets say I want to ask the user for an input, a number over 10. If not, print a message and re-prompt/ask again. How can this be achieved in R?
I understand that this could be solved with IF or WHILE statement, but I canĀ“t wrap my head around this.
Example
math <- function(number_1) {
number_1 <- readline("Enter your number: ")
if the number is below i want to reprompt readline(...)
result <- number_1 / 2
return(result)
}
Here's a way:
math <- function() {
result <- NA
while (is.na(result) || result < 10) {
text <- readline("Enter your number: ")
result <- as.numeric(text)
}
result
}
You don't need to give any input to your function; it will get the input when it prompts the user. The is.na(result) code checks for an NA: initially the result is NA, so it will run the loop at least once, and if
the user enters something that isn't a number, you'll get another one.
Since readline() returns a character value, you need as.numeric to convert it to a number.

Are there text processing function that operate on word level in R?

I am trying to find a group of functions in R that would operate on word level. e.g. a function that could return the position of the word. For example given the following sentence and query
sentence <- "A sample sentence for demo"
query <- "for"
the function would return 4. for is 4th word.
It would be great if I could get a utility function that would allow me to extend query both in left and right direction.
e.g. extend(query, 'right') would return for demo and extend(query, 'left') would return sentence for
I have already gone through functions like grep, gregexp, word from stringr package and others. All seem to operate on character level.
If you use scan, it will split input at whitespace:
> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4
Need the what="" to tell scan to expect character rather than numeric input. Might need to replace punctuation using gsub with patt="[[:punct:]]" if your input is ever full English sentences. May also need to look at the tm (text mining) package if you are trying to classify parts of speech or handle large documents.
As I mentioned in my comment, stringr is useful in these instances.
library(stringr)
sentence <- "A sample sentence for demo"
wordNumber <- 4L
fourthWord <- word(string = sentence,
start = wordNumber)
previousWords <- word(string = sentence,
start = wordNumber - 1L,
end = wordNumber)
laterWords <- word(string = sentence,
start = wordNumber,
end = wordNumber + 1L)
And this yields:
> fourthWord
[1] "for"
> previousWords
[1] "sentence for"
> laterWords
[1] "for demo"
I hope that helps you.
I have written my own functions, the indexOf method returns the index of the word if it is found in the sentence otherwise returns -1, very much like java indexOf()
indexOf <- function(sentence, word){
listOfWords <- strsplit(sentence, split = " ")
sentenceAsVector <- unlist(listOfWords)
if(word %in% sentenceAsVector == FALSE){
result=-1
}
else{
result = which(sentenceAsVector==word)
}
return(result)
}
The extend method is working properly but is quite lengthy doesn't look like R code at all. If query is a word on the boundary of the sentence, i.e. the first word or the last word, first two words or last two words are returned
extend <- function(sentence, query, direction){
listOfWords = strsplit(sentence, split = " ")
sentenceAsVector = unlist(listOfWords)
lengthOfSentence = length(sentenceAsVector)
location = indexOf(sentence, query)
boundary = FALSE
if(location == 1 | location == lengthOfSentence){
boundary = TRUE
}
else{
boundary = FALSE
}
if(!boundary){
if(location> 1 & direction == "right"){
return(paste(sentenceAsVector[location],
sentenceAsVector[location + 1],
sep=" ")
)
}
else if(location < lengthOfSentence & direction == "left"){
return(paste(sentenceAsVector[location - 1],
sentenceAsVector[location],
sep=" ")
)
}
}
else{
if(location == 1 ){
return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
}
if(location == lengthOfSentence){
return(paste(sentenceAsVector[lengthOfSentence - 1],
sentenceAsVector[lengthOfSentence], sep = " "))
}
}
}
The answer depends on what you mean by a "word". If you mean whitespace-separated token, then #imran-ali's answer works fine. If you mean word as defined by Unicode, with special attention to punctuation, then you need something more sophisticated.
The following handles punctuation correctly:
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"
# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
## text before instance after
## 1 1 A sample sentence for demo
# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4
This also works if there are multiple matches:
sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1] 1 4 7 10
We can verify that this is correct:
text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"

Exchange the order of only the middle letters of a word and keep the order of the punctuation mark if there is one

I require a function that exchanges only the characters of the middle of the word (between first and last character). The function should not change anything if the word is composed of less than 4 characters. The function has an argument called exclude which is a vector of potential punctuation marks at the end of a word.
You can assume that there is only one possible punctuation mark. If there is such a mark, it should not be changed and not counted towards the number of characters the actual word is composed of.
My idea is to
exclude any punctuation mark
take out the first letter of the word
take out the last letter of the word
exchange the remaining letters
add back the first and last letters and punctuation marks
string <-c("well!")
interchange.middle.of.word <- function(string, exclude = c(",",".","!","?")){
result1 <- strsplit(string, split= "")
result2 <- unlist(result1)
result3 <- result2[ - which(result2 %in% exclude)]
result4 <- head(result3,-1)
result5 <- tail(result4,-1)
result6 <- sample(result5,replace = FALSE)
result7 <- c(result3[1],result6)
result8 <- c(result7, result4[length(result4)])
result9 <- c(result8, result2[which(result2 %in% exclude)])
result10 <- paste(result9,collapse="")
return(result10)
}
The code worked for the example 'well!', but not for another example like 'hello?' or words less than 3 letters like 'Tes':
for 'hello?' my result was 'hlell?'
for 'Tes' my result was 'NA'
I would be grateful if anyone can tell me where I went wrong about the code.
Many thanks.
Assuming you want to reverse the order of the middle characters:
revmiddle <- function(s,exclude=c(',','.','!','?')) {
if (nchar(s)<4L) return(s);
x <- strsplit(s,'')[[1L]];
if (x[length(x)]%in%exclude) {
punc <- x[length(x)];
x <- x[-length(x)];
} else {
punc <- NULL;
}; ## end if
paste(collapse='',c(x[1L],x[seq(length(x)-1L,2L)],x[length(x)],punc));
}; ## end revmiddle()
Demo:
revmiddle('well!');
## [1] "wlel!"
revmiddle('hello?');
## [1] "hlleo?"
revmiddle('Tes');
## [1] "Tes"
To randomize the order of the middle characters:
randmiddle <- function(s,exclude=c(',','.','!','?')) {
if (nchar(s)<4L) return(s);
x <- strsplit(s,'')[[1L]];
if (x[length(x)]%in%exclude) {
punc <- x[length(x)];
x <- x[-length(x)];
} else {
punc <- NULL;
}; ## end if
paste(collapse='',c(x[1L],sample(x[-c(1L,length(x))]),x[length(x)],punc));
}; ## end randmiddle()
Demo (intentionally performing many executions and collecting sorted unique results, which effectively demonstrates all possible outcomes):
sort(unique(replicate(1e3L,randmiddle('well!'))));
## [1] "well!" "wlel!"
sort(unique(replicate(1e3L,randmiddle('hello?'))));
## [1] "hello?" "hlelo?" "hlleo?"
sort(unique(replicate(1e3L,randmiddle('Tes'))));
## [1] "Tes"

Finding the position of a character within a string

I am trying to find the equivalent of the ANYALPHA SAS function in R. This function searches a character string for an alphabetic character, and returns the first position at which at which the character is found.
Example: looking at the following string '123456789A', the ANYALPHA function would return 10 since first alphabetic character is at position 10 in the string. I would like to replicate this function in R but have not been able to figure it out. I need to search for any alphabetic character regardless of case (i.e. [:alpha:])
Thanks for any help you can offer!
Here's an anyalpha function. I added a few extra features. You can specify the maximum amount of matches you want in the n argument, it defaults to 1. You can also specify if you want the position or the value itself with value=TRUE:
anyalpha <- function(txt, n=1, value=FALSE) {
txt <- as.character(txt)
indx <- gregexpr("[[:alpha:]]", txt)[[1]]
ret <- indx[1:(min(n, length(indx)))]
if(value) {
mapply(function(x,y) substr(txt, x, y), ret, ret)
} else {ret}
}
#test
x <- '123A56789BC'
anyalpha(x)
#[1] 4
anyalpha(x, 2)
#[1] 4 10
anyalpha(x, 2, value=TRUE)
#[1] "C" "A"

Resources