Capitalizing letters. R equivalent of excel "PROPER" function [duplicate] - r

This question already has answers here:
Capitalize the first letter of both words in a two word string
(15 answers)
Closed 6 years ago.
Colleagues,
I'm looking at a data frame resembling the extract below:
Month Provider Items
January CofCom 25
july CofCom 331
march vobix 12
May vobix 0
I would like to capitalise first letter of each word and lower the remaining letters for each word. This would result in the data frame resembling the one below:
Month Provider Items
January Cofcom 25
July Cofcom 331
March Vobix 12
May Vobix 0
In a word, I'm looking for R's equivalent of the ROPER function available in the MS Excel.

With regular expressions:
x <- c('woRd Word', 'Word', 'word words')
gsub("(?<=\\b)([a-z])", "\\U\\1", tolower(x), perl=TRUE)
# [1] "Word Word" "Word" "Word Words"
(?<=\\b)([a-z]) says look for a lowercase letter preceded by a word boundary (e.g., a space or beginning of a line). (?<=...) is called a "look-behind" assertion. \\U\\1 says replace that character with it's uppercase version. \\1 is a back reference to the first group surrounded by () in the pattern. See ?regex for more details.
If you only want to capitalize the first letter of the first word, use the pattern "^([a-z]) instead.

The question is about an equivalent of Excel PROPER and the (former) accepted answer is based on:
proper=function(x) paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
It might be worth noting that:
proper("hello world")
## [1] "Hello world"
Excel PROPER would give, instead, "Hello World". For 1:1 mapping with Excel see #Matthew Plourde.
If what you actually need is to set only the first character of a string to upper-case, you might also consider the shorter and slightly faster version:
proper=function(s) sub("(.)", ("\\U\\1"), tolower(s), pe=TRUE)

Another method uses the stringi package. The stri_trans_general function appears to lower case all letters other than the initial letter.
require(stringi)
x <- c('woRd Word', 'Word', 'word words')
stri_trans_general(x, id = "Title")
[1] "Word Word" "Word" "Word Words"

I dont think there is one, but you can easily write it yourself
(dat <- data.frame(x = c('hello', 'frIENds'),
y = c('rawr','rulZ'),
z = c(16, 18)))
# x y z
# 1 hello rawr 16
# 2 frIENds rulZ 18
proper <- function(x)
paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
(dat <- data.frame(lapply(dat, function(x)
if (is.numeric(x)) x else proper(x)),
stringsAsFactors = FALSE))
# x y z
# 1 Hello Rawr 16
# 2 Friends Rulz 18
str(dat)
# 'data.frame': 2 obs. of 3 variables:
# $ x: chr "Hello" "Friends"
# $ y: chr "Rawr" "Rulz"
# $ z: num 16 18

Related

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

How to get the first 10 words in a string in R?

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

R - Convert a column of factors into binary characters without loss of information

I am very new to R and just teaching myself how to use it. I'm using R version 3.0.1 on Windows 7 (if that's relevant).
I have trouble converting data of factors into only characters. My data is as follows:
activity <- c("1","2","10","ZZ")
What I want to have as output is
activity <- c("01","02","10","ZZ")
where, each string, if only contains one character, should be prefixed by a 0 (as shown above).
I tried using "as.character" but that doesn't add a zero before. Then I found sprintf and tried:
activity <- sprintf("%02d", (activity))
# [1] "01" "02" "03" "04"
This adds a zero "0" in front of any single data found but what's troublesome is that it modifies all levels of the data (as shown above).
Does anybody know what's wrong here and how I can fix it? Thank you.
You can use regular expressions, particularly the function sub to replace any single digit with a 0 followed by that digit. You should do this to replace the levels of your factor activity so that the whole data is changed accordingly:
levels(activity) <- sub("^([0-9])$", "0\\1", levels(activity))
# [1] 01 02 10 ZZ
# Levels: 01 02 10 ZZ
Edit: If you want to not just replace numbers but any string with just 1 character, then you can just replace [0-9] with .. That is:
# suppose x is:
x <- c("1", "a", "Y", "!", "bb", "45")
x <- factor(x, levels=unique(x))
levels(x) <- sub("^(.)$", "0\\1", levels(x))
# [1] 01 0a 0Y 0! bb 45
# Levels: 01 0a 0Y 0! bb 45
Read ?factor for the proper way to convert factors back to their values. You need to be cautious about manipulating factors as you've seen since sometimes you will wind up altering the underlying index rather than the level of the factor.
Also, you also cannot "zero pad" characters:
y <- factor(c('1', '2', '10', 'ZZ'))
x <- as.character(y)
sprintf('%02d', x)
Error in sprintf("%02d", x) :
invalid format '%02d'; use format %s for character objects
Instead, you could use a yucky ifelse:
ifelse(is.na(as.numeric(x)), x, sprintf('%02d', as.numeric(x)))
[1] "01" "02" "10" "ZZ"
But as Arun has shown, regular expressions are the way to go here!

Extracting characters from entries in a vector in R

There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.
You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"

Count the number of all words in a string

Is there a function to count the number of words in a string?
For example:
str1 <- "How many words are in this sentence"
to return a result of 7.
Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.
lengths(gregexpr("\\W+", str1)) + 1
This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.
Update As noted in the comments and in a different answer by #Andri the approach fails with (zero) and one-word strings, and with trailing punctuation
str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3
Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; #Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
Leading to
sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3
Again the regular expression might be refined for different notions of 'word'.
I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like #user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is
lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3
This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.
Most simple way would be:
require(stringr)
str_count("one, two three 4,,,, 5 6", "\\S+")
... counting all sequences on non-space characters (\\S+).
But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6
I use the str_count function from the stringr library with the escape sequence \w that represents:
any ‘word’ character (letter, digit or underscore in the current
locale: in UTF-8 mode only ASCII letters and digits are considered)
Example:
> str_count("How many words are in this sentence", '\\w+')
[1] 7
Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.
But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase ?".
Benchmark:
library(stringr)
questions <-
c(
"", "x", "x y", "x y!", "x y! z",
"foo+bar+baz~spam+eggs",
"one, two three 4,,,, 5 6",
"How many words are in this sentence",
"How many words are in this sentence",
"Combien de mots sont dans cette phrase ?",
"
Day after day, day after day,
We stuck, nor breath nor motion;
"
)
answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)
score <- function(f) sum(unlist(lapply(questions, f)) == answers)
funs <-
c(
function(s) sapply(gregexpr("\\W+", s), length) + 1,
function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
function(s) length(str_match_all(s, "\\S+")[[1]]),
function(s) str_count(s, "\\S+"),
function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
function(s) length(unlist(strsplit(s," "))),
function(s) sapply(strsplit(s, " "), length),
function(s) str_count(s, '\\w+')
)
unlist(lapply(funs, score))
Output (11 is the maximum possible score):
6 10 10 8 9 9 7 6 6 11
You can use strsplit and sapply functions
sapply(strsplit(str1, " "), length)
str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])
The gsub(' {2,}',' ',str1) makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.
The strsplit(str,' ') splits the sentence at every space and returns the result in a list. The [[1]] grabs the vector of words out of that list. The length counts up how many words.
> str1 <- "How many words are in this sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> strsplit(str2,' ')[[1]]
[1] "How" "many" "words" "are" "in" "this" "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7
You can use str_match_all, with a regular expression that would identify your words.
The following works with initial, final and duplicated spaces.
library(stringr)
s <- "
Day after day, day after day,
We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\\S+" ) # Sequences of non-spaces
length(m[[1]])
Try this function from stringi package
require(stringi)
> s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
+ "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
+ "Cras vel lorem. Etiam pellentesque aliquet tellus.",
+ "")
> stri_stats_latex(s)
CharsWord CharsCmdEnvir CharsWhite Words Cmds Envirs
133 0 30 24 0 0
Also from stringi package, the straight forward function stri_count_words
stringi::stri_count_words(str1)
#[1] 7
You can use wc function in library qdap:
> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7
You can remove double spaces and count the number of " " in the string to get the count of words. Use stringr and rm_white {qdapRegex}
str_count(rm_white(s), " ") +1
Try this
length(unlist(strsplit(str1," ")))
require(stringr)
str_count(x,"\\w+")
will be fine with double/triple spaces between words
All other answers have issues with more than one space between the words.
The solution 7 does not give the correct result in the case there's just one word.
You should not just count the elements in gregexpr's result (which is -1 if there where not matches) but count the elements > 0.
Ergo:
sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1
require(stringr)
Define a very simple function
str_words <- function(sentence) {
str_count(sentence, " ") + 1
}
Check
str_words(This is a sentence with six words)
Use nchar
if vector of strings is called x
(nchar(x) - nchar(gsub(' ','',x))) + 1
Find out number of spaces then add one
You could use stringr functions str_split() and boundary(), which will recognize the boundaries of words while ignoring punctuation and any extra spaces
sapply(str_split("It's 12 o'clock already", boundary("word")), length)
#[1] 4
sapply(str_split(" It's >12 o'clock already ?! ", boundary("word")), length)
#[1] 4
With stringr package, one can also write a simple script that could traverse a vector of strings for example through a for loop.
Let's say
df$text
contains a vector of strings that we are interested in analysing. First, we add additional columns to the existing dataframe df as below:
df$strings = as.integer(NA)
df$characters = as.integer(NA)
Then we run a for-loop over the vector of strings as below:
for (i in 1:nrow(df))
{
df$strings[i] = str_count(df$text[i], '\\S+') # counts the strings
df$characters[i] = str_count(df$text[i]) # counts the characters & spaces
}
The resulting columns: strings and character will contain the counts of words and characters and this will be achieved in one-go for a vector of strings.
I've found the following function and regex useful for word counts, especially in dealing with single vs. double hyphens, where the former generally should not count as a word break, eg, well-known, hi-fi; whereas double hyphen is a punctuation delimiter that is not bounded by white-space--such as for parenthetical remarks.
txt <- "Don't you think e-mail is one word--and not two!" #10 words
words <- function(txt) {
length(attributes(gregexpr("(\\w|\\w\\-\\w|\\w\\'\\w)+",txt)[[1]])$match.length)
}
words(txt) #10 words
Stringi is a useful package. But it over-counts words in this example due to hyphen.
stringi::stri_count_words(txt) #11 words
There's a simple solution using split and len:
text = 'This is a test for counting words'
# default separator: space
result = len(text.split())
print("There are " + str(result) + " words.")
You can get more details at
https://www.delftstack.com/howto/python/python-count-words-in-string/

Resources