center a string by padding spaces up to a specified length - r

I have a vector of names, like this:
x <- c("Marco", "John", "Jonathan")
I need to format it so that the names get centered in 10-character strings, by adding leading and trailing spaces:
> output
# [1] " Marco " " John " " Jonathan "
I was hoping a solution less complicated than to go with paste, rep, and counting nchar? (maybe with sprintf but I don't know how).

Here's a sprintf() solution that uses a simple helper vector f to determine the low side widths. We can then insert the widths into our format using the * character, taking the ceiling() on the right side to account for an odd number of characters in a name. Since our max character width is at 10, each name that exceeds 10 characters will remain unchanged because we adjust those widths with pmax().
f <- pmax((10 - nchar(x)) / 2, 0)
sprintf("%-*s%s%*s", f, "", x, ceiling(f), "")
# [1] " Marco " " John " " Jonathan " "Christopher"
Data:
x <- c("Marco", "John", "Jonathan", "Christopher")

Eventually, I know it's not the same language, but it is Worth noting that Python (and not R) has a built-in method for doing just that, it's called centering a string:
example = "John"
example.center(10)
#### ' john '
It adds to the right for odd Numbers, and allows you to input the filling character of your choice. ALthough it's not vectorized.

Related

Why does paste() concatenate list elements in the wrong order?

Given the following string:
my.str <- "I welcome you my precious dude"
One splits it:
my.splt.str <- strsplit(my.str, " ")
And then concatenates:
paste(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6], sep = " ")
The result is:
[1] "I you precious" "welcome my dude"
When not using the colon operator it returns the correct order:
paste(my.splt.str[[1]][1], my.splt.str[[1]][2], my.splt.str[[1]][3], my.splt.str[[1]][4], my.splt.str[[1]][5], my.splt.str[[1]][6], sep = " ")
[1] "I welcome you my precious dude"
Why is this happening?
paste is designed to work with vectors element-by-element. Say you did this:
names <- c('Alice', 'Bob', 'Charlie')
paste('Hello', names)
You'd want to result to be [1] "Hello Alice" "Hello Bob" "Hello Charlie", rather than "Hello Hello Hello Alice Bob Charlie".
To make it work like you want it to, rather than giving the different sections to paste as separate arguments, you could first combine them into a single vector with c:
paste(c(my.splt.str[[1]][1:2], my.splt.str[[1]][3:4], my.splt.str[[1]][5:6]), collapse = " ")
## [1] "I welcome you my precious dude"
We can use collapse instead of sep
paste(my.splt.str[[1]], collapse= ' ')
If we use the first approach by OP, it is pasteing the corresponding elements from each of the subset
If we want to selectively paste, first create an object because the [[ repeat can be avoided
v1 <- my.splt.str[[1]]
v1[3:4] <- toupper(v1[3:4])
paste(v1, collapse=" ")
#[1] "I welcome YOU MY precious dude"
When we have multiple arguments in paste, it is doing the paste on the corresponding elements of it
paste(v1[1:2], v1[3:4])
#[1] "I you" "welcome my"
If we use collapse, then it would be a single string, but still the order is different because the first element of v1[1:2] is pasteed with the first element of v1[3:4] and 2nd with the 2nd element
paste(v1[1:2], v1[3:4], collapse = ' ')
#[1] "I you welcome my"
It is documented in ?paste
paste converts its arguments (via as.character) to character strings, and concatenates them (separating them by the string given by sep). If the arguments are vectors, they are concatenated term-by-term to give a character vector result. Vector arguments are recycled as needed, with zero-length arguments being recycled to "".
Also, converting to uppercase can be done on a substring without splitting as well
sub("^(\\w+\\s+\\w+)\\s+(\\w+\\s+\\w+)", "\\1 \\U\\2", my.str, perl = TRUE)
#[1] "I welcome YOU MY precious dude"

Edge Conditional White Space Issue R

I'm trying to clean a long character vector and am getting an edge case where separating the following format of text isn't possible:
$4.917.10%
The issue is how to set a conditional whitespace so that the text looks like this: $4.91 7.10%.
The vector is called "test9" and the script that cleans the typical situations where there is a "-" in front of % is:
gsub("(?=[-])", " ", test9, perl = TRUE)
The edge case is infrequent but a feature of the vector that needs to be adjusted for. There isn't a fixed number of digits to the left of the decimal (whether expressing $ or %) but there are always two decimals to the right of a decimal which makes me think conditionally approaching that is probably the way to go.
Here is a sample of a large piece of one element of the vector:
$28.00$25.0518.09%
Thanks!
Here's another option.
gsub("(?<=\\.\\d{2})(?!%)", " ", "$28.00$25.0518.09%", perl = TRUE)
# [1] "$28.00 $25.05 18.09%"
We have a positive lookbehind (?<=\\.\\d{2}) looking for a dot and two digits, and a negative lookahead (?!%) for %.
More generally, I guess you may also have "$28.00$25.0518.09%18.09%" in which case we need something else:
gsub("((?<=\\.\\d{2})|(?<=%))(?=[\\d$])", " ", "$28.00$25.0518.09%18.09%", perl = TRUE)
# [1] "$28.00 $25.05 18.09% 18.09%"
Now we have either a positive lookbehind for a dot and two digits or a positive lookbehind for %, and a positive lookahead for a digit or the end of a character.
If I understand correctly that your general problem is of the form "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05", then we may use almost the same solution as the latter one:
gsub("((?<=\\.\\d{2})|(?<=%))(?=[\\d$-])", " ", "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05", perl = TRUE)
# [1] "$28.00 $25.05 18.09% -7.10% $25.05 -$25.05 $25.05"
One option is to do it in two stages. First insert a space after every second decimal. Then remove the unwanted space this inserts before a %
x = '$28.00$25.0518.09%'
y = gsub('(\\.\\d{2})', '\\1 ', x, perl = T) #insert space after decimals
trimws(gsub('\\s%', '% ', y)) # move space from before % to after %
# "$28.00 $25.05 18.09%"
This should also work for the more general cases #Julius described
x = "$28.00$25.0518.09%18.09%" # "$28.00 $25.05 18.09% 18.09%"
x = "$28.00$25.0518.09%-7.10%$25.05-$25.05$25.05" # "$28.00 $25.05 18.09% -7.10% $25.05 -$25.05 $25.05"

How to count the number of segments in a string in r?

I have a string printed out like this:
"\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
(The "\" wasn't there. R just automatically prints it out.)
I would like to calculate how many non-empty segments there are in this string. In this case the answer should be 11.
I tried to convert it to a vector, but R ignores the quotation marks so I still ended up with a vector with length 1.
I don't know whether I need to extract those segments first and then count, or there're easier ways to do that.
If it's the former case, which regular expression function best suits my need?
Thank you very much.
You can use scan to convert your large string into a vector of individual ones, then use nchar to count the lengths. Assuming your large string is x:
y <- scan(text=x, what="character", sep=",", strip.white=TRUE)
Read 12 items
sum(nchar(y)>0)
[1] 11
I assume a segment is defined as anything between . or ,. An option using strsplit can be found as:
length(grep("\\w+", trimws(strsplit(str, split=",|\\.")[[1]])))
#[1] 11
Note: trimws is not mandatory in above statement. I have included so that one can get the value of each segment by just adding value = TRUE argument in grep.
Data:
str <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
strsplit might be one possibility?
txt <- "Jenna and Alex were making cupcakes., Jenna asked Alex whether all were ready to be frosted.,
Alex said that, some of them , were., He added, that, the rest, would be, ready, soon.,"
a <- strsplit(txt, split=",")
length(a[[1]])
[1] 11
If the backslashes are part of the text it doesnt really change a lot, except for the last element which would have "\"" in it. By filtering that out, the result is the same:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all
were ready to be frosted.\", \"Alex said that\", \" some of them \",
\"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
a <- strsplit(txt, split=", \"")
length(a[[1]][a[[1]] != "\""])
[1] 11
This is an absurd idea, but it does work:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
Txt <-
read.csv(text = txt,
header = FALSE,
colClasses = "character",
na.strings = c("", " "))
sum(!vapply(Txt, is.na, logical(1)))

Regular expression not working in R but works on website. Text mining

I have a regex which works on the regular expression website but doesn't work when I copy it in R. Below is the code to recreate my data frame:
text <- data.frame(page = c(1,1,2,3), sen = c(1,2,1,1),
text = c("Dear Mr case 1",
"the value of my property is £500,000.00 and it was built in 1980",
"The protected percentage is 0% for 2 years",
"The interest rate is fixed for 2 years at 4.8%"))
regex working on website: https://regex101.com/r/OcVN5r/2
Below is the R codes I have tried so far and neither works.
library(stringr)
patt = "dear\\s+(mr|mrs|miss|ms)\\b[^£]+(£[\\d,.]+)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)"
str_extract(text, patt)
grepl(pattern = patt, x = text)
I'm getting an error saying the regex is wrong but it works on the website. Not sure how to get it to work in r.
Basically I am trying to extract pieces of information from the text. Below are the details:
From the above dataframe, I need to extract the following:
1: Gender of the person. In this case it would be Male (looking at Mr)
2: The number that represents the property value. in this case would be £500,000.00.
3: The protected percentage value, which in our case would be 0%.
4: The interest rate value and in our case it is 4.8%.
I think you can do this with regexpr function.
For an example:
text = "Dear Mr case 1, the value of my property is £500,000.00 and it was built in 1980, The protected percentage is 13% for 2 years, The interest rate is fixed for 2 years at 4.8%";
grps <- regexpr (pattern=patt, text = text, perl=TRUE, ignore.case=TRUE);
start_idx <- attr (grps, "capture.start");
end_idx <- start_idx + attr (grps, "capture.length");
substring (text = text, first = start_idx, last = end_idx);
This matches: [1] "Mr " "£500,000.00 " "13% " "4.8%"
From the manual:
regexpr returns an integer vector of the same length as text giving the starting position of the first match or -1 if there is
none, with attribute "match.length", an integer vector giving the
length of the matched text (or -1 for no match). The match positions
and lengths are in characters unless useBytes = TRUE is used, when
they are in bytes (as they are for an ASCII-only matching: in either
case an attribute useBytes with value TRUE is set on the result). If
named capture is used there are further attributes "capture.start",
"capture.length" and "capture.names".
gregexpr returns a list of the same length as text each element of
which is of the same form as the return value for regexpr, except that
the starting positions of every (disjoint) match are given.
In your case I think you need to paste the lines together by using
full_line <- paste (text[,"text"], collapse=" ");
Then apply regexpr on full_line
I think the issue is your regex isn't giving alternate or "OR" matches. See below based on your bullet list
library(stringi)
rgx <- "(?<=dear\\s?)(m(r(s)?|s|iss))|\\p{S}([0-9]\\S+)|([0-9]+)((\\.[0-9]{1,})?)\\%"
stri_extract_all_regex(
text$text, rgx, opts_regex = stri_opts_regex(case_insensitive = T)
) %>% unlist()
Which gives
[1] "Mr" "£500,000.00" "0%" "4.8%"
The pattern says:
"(?<=dear\\s?)(m(r(s)?|s|iss))" = find a match where the word dear appears before a mr, ms, mrs or miss... but don't capture the dear or the leading space
| = OR
"\\p{S}([0-9]\\S+)" = find a match where a sequence of numbers occurs, after a symbol (see ?stringi-search-charclass), until there is a white space. But It must have a symbol at the beginning
| = OR
"([0-9]+)((\\.[0-9]{1,})?)\\%" = find a match where a number occurs one or more times, that may have a decimal with numbers after it, but will end in a percent sign

How to Convert "space" into "%20" with R

Referring the title, I'm figuring how to convert space between words to be %20 .
For example,
> y <- "I Love You"
How to make y = I%20Love%20You
> y
[1] "I%20Love%20You"
Thanks a lot.
Another option would be URLencode():
y <- "I love you"
URLencode(y)
[1] "I%20love%20you"
gsub() is one option:
R> gsub(pattern = " ", replacement = "%20", x = y)
[1] "I%20Love%20You"
The function curlEscape() from the package RCurl gets the job done.
library('RCurl')
y <- "I love you"
curlEscape(urls=y)
[1] "I%20love%20you"
I like URLencode() but be aware that sometimes it does not work as expected if your url already contains a %20 together with a real space, in which case not even the repeated option of URLencode() is doing what you want.
In my case, I needed to run both URLencode() and gsub consecutively to get exactly what I needed, like so:
a = "already%20encoded%space/a real space.csv"
URLencode(a)
#returns: "encoded%20space/real space.csv"
#note the spaces that are not transformed
URLencode(a, repeated=TRUE)
#returns: "encoded%2520space/real%20space.csv"
#note the %2520 in the first part
gsub(" ", "%20", URLencode(a))
#returns: "encoded%20space/real%20space.csv"
In this particular example, gsub() alone would have been enough, but URLencode() is of course doing more than just replacing spaces.

Resources