I'm trying to create a function able to return various versions of the same string but with blank spaces between the letters.
something like:
input <- "word"
returning:
w ord
wo rd
wor d
We first break the string into every character using strsplit. We then append an empty space at every position using sapply.
input <- "word"
input_break <- strsplit(input, "")[[1]]
c(input, sapply(seq(1,nchar(input)-1), function(x)
paste0(append(input_break, " ", x), collapse = "")))
#[1] "word" "w ord" "wo rd" "wor d"
?append gives us append(x, values, after = length(x))
where x is the vector, value is the value to be inserted (here " " ) and after is after which place you want to insert the values.
Here is an option using sub
sapply(seq_len(nchar(input)-1), function(i) sub(paste0('^(.{', i, '})'), '\\1 ', input))
#[1] "w ord" "wo rd" "wor d"
Or with substring
paste(substring(input, 1, 1:3), substring(input, 2:4, 4))
#[1] "w ord" "wo rd" "wor d"
Related
I am trying to format UK postcodes that come in as a vector of different input in R.
For example, I have the following postcodes:
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
How do I write a generic code that would convert entries of the above vector into:
c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")
That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.
EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)
I couldn't come up with a smart regex solution so here is a split-apply-combine approach.
sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.
One approach:
Convert to uppercase
extract the alphanumeric characters
Paste back together with a space before the last three characters
The code would then be:
library(stringr)
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.
sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
# Option 1 using regex:
res1 <- gsub(
"(\\w+)(\\d[[:upper:]]\\w+$)",
"\\1 \\2",
gsub(
"\\W+",
" ",
postcodes
)
)
# Option 2 using substrings:
res2 <- vapply(
trimws(
gsub(
"\\W+",
" ",
postcodes
)
),
function(ir){
paste(
trimws(
substr(
ir,
1,
nchar(ir) -3
)
),
substr(
ir,
nchar(ir) -2,
nchar(ir)
)
)
},
character(1),
USE.NAMES = FALSE
)
I have a vector:
vector <- c("A", "B", "C")
And I want to print the following:
[1] A then B and C
[1] B then A and C
[1] C then A and B
I have been working with a for loop. However, I can't figure out how to print the sequence seperated by 'and'?
for(i in vector){
print(paste(i, "then", XXX))
}
I guess something needs to added where I wrote XXX?
You can use paste with collapse = " then " and reorder vector using [ in your for loop.
for(i in seq_along(vector)) {
print(paste0(vector[i], " then ", paste(vector[-i], collapse = " and ")))
}
#[1] "A then B and C"
#[1] "B then A and C"
#[1] "C then A and B"
You can use setdiff to find the remaining vector and then paste with collapse= to put that whole vector together into some text:
for(i in vector){
remaining.elements.vector <- setdiff(vector, i)
remaining.elements.text <- paste(remaining.elements.vector, collapse=' and ')
print(paste(i, "then", remaining.elements.text))
}
This seems like a simple question, but I can't find a solution. I want to take all of the objects (character vectors) in my environment and use them as arguments in a paste function. But the catch is I want to do so without specifying them all individually.
a <- "foo"
b <- "bar"
c <- "baz"
z <- paste(a, b, c, sep = " ")
z
[1] "foo bar baz"
I imagine that there must be something like the ls() would offer this, but obviously
z <- paste(ls(), collapse = " ")
z
[1] "a b c"
not "foo bar baz", which is what I want.
We can use mget to return the values of the objects in a list and then with do.call paste them into a single string
do.call(paste, c(mget(ls()), sep= " "))
As the sep is " ", we don't need that in paste as it by default giving a space
do.call(paste, mget(ls()))
I am balancing several versions of R and want to change my R libraries loaded depending on which R and which operating system I'm using. As such, I want to stick with base R functions.
I was reading this page to see what the base R equivalent to stringr::str_extract was:
http://stat545.com/block022_regular-expression.html
It suggested I could replicate this functionality with grep. However, I haven't been able to get grep to do more than return the whole string if there is a match. Is this possible with grep alone, or do I need to combine it with another function? In my case I'm trying to distinguish between CentOS versions 6 and 7.
grep(pattern = "release ([0-9]+)", x = readLines("/etc/system-release"), value = TRUE)
1) strcapture If you want to extract a string of digits and dots from "release 1.2.3" using base then
x <- "release 1.2.3"
strcapture("([0-9.]+)", x, data.frame(version = character(0)))
## version
## 1 1.2.3
2) regexec/regmatches There is also regmatches and regexec but that has already been covered in another answer.
3) sub Also it is often possible to use sub:
sub(".* ([0-9.]+).*", "\\1", x)
## [1] "1.2.3"
3a) If you know the match is at the beginning or end then delete everything after or before it:
sub(".* ", "", x)
## [1] "1.2.3"
4) gsub Sometimes we know that the field to be extracted has certain characters and they do not appear elsewhere. In that case simply delete every occurrence of every character that cannot be in the string:
gsub("[^0-9.]", "", x)
## [1] "1.2.3"
5) read.table One can often decompose the input into fields and then pick off the desired one by number or via grep. strsplit, read.table or scan can be used:
read.table(text = x, as.is = TRUE)[[2]]
## [1] "1.2.3"
5a) grep/scan
grep("^[0-9.]+$", scan(textConnection(x), what = "", quiet = TRUE), value = TRUE)
## [1] "1.2.3"
5b) grep/strsplit
grep("^[0-9.]+$", strsplit(x, " ")[[1]], value = TRUE)
## [1] "1.2.3"
6) substring If we know the character position of the field we can use substring like this:
substring(x, 9)
## [1] "1.2.3"
6a) substring/regexpr or we may be able to use regexpr to locate the character position for us:
substring(x, regexpr("\\d", x))
## [1] "1.2.3"
7) read.dcf Sometimes it is possible to convert the input to dcf form in which case it can be read with read.dcf. Such data is of the form name: value
read.dcf(textConnection(sub(" ", ": ", x)))
## release
## [1,] "1.2.3"
You could do
txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
stringr::str_extract(txt, pattern)
# [1] "release 123" NA "release 123"
sapply(regmatches(txt, regexec(pattern, txt)), "[", 1)
# [1] "release 123" NA "release 123"
txt <- c("foo release 123", "bar release", "foo release 123 bar release 123")
pattern <- "release ([0-9]+)"
Extract first match
sapply(
X = txt,
FUN = function(x){
tmp = regexpr(pattern, x)
m = attr(tmp, "match.length")
st = unlist(tmp)
if (st == -1){NA}else{substr(x, start = st, stop = st + m - 1)}
},
USE.NAMES = FALSE)
#[1] "release 123" NA "release 123"
Extract all matches
sapply(
X = txt,
FUN = function(x){
tmp = gregexpr(pattern, x)
m = attr(tmp[[1]], "match.length")
st = unlist(tmp)
if (st[1] == -1){
NA
}else{
sapply(seq_along(st), function(i) substr(x, st[i], st[i] + m[i] - 1))
}
},
USE.NAMES = FALSE)
#[[1]]
#[1] "release 123"
#[[2]]
#[1] NA
#[[3]]
#[1] "release 123" "release 123"
What is the cleanest way of finding for example the string ": [1-9]*" and only keeping that part?
You can work with regexec to get the starting points, but isn't there a cleaner way just to get immediately the value?
For example:
test <- c("surface area: 458", "bedrooms: 1", "whatever")
regexec(": [1-9]*", test)
How do I get immediately just
c(": 458",": 1", NA )
You can use base R which handles this just fine.
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> r <- regmatches(x, gregexpr(':.*', x))
> unlist({r[sapply(r, length)==0] <- NA; r})
# [1] ": 458" ": 1" NA
Although, I find it much simpler to just do...
> x <- c('surface area: 458', 'bedrooms: 1', 'whatever')
> sapply(strsplit(x, '\\b(?=:)', perl=T), '[', 2)
# [1] ": 458" ": 1" NA
library(stringr)
str_extract(test, ":.*")
#[1] ": 458" ": 1" NA
Or for a faster approach stringi
library(stringi)
stri_extract_first_regex(test, ":.*")
#[1] ": 458" ": 1" NA
If you need the keep the values of the one that doesn't have the match
gsub(".*(:.*)", "\\1", test)
#[1] ": 458" ": 1" "whatever"
Try any of these. The first two use the base of R only. The last one assumes that we want to return a numeric vector.
1) sub
s <- sub(".*:", ":", test)
ifelse(test == s, NA, s)
## [1] ": 458" ": 1" NA
If there can be more than one : in a string then replace the pattern with "^[^:]*:" .
2) strsplit
sapply(strsplit(test, ":"), function(x) c(paste0(":", x), NA)[2])
## [1] ": 458" ": 1" NA
Do not use this one if there can be more than one : in a string.
3) strapplyc
library(gsubfn)
s <- strapplyc(test, "(:.*)|$", simplify = TRUE)
ifelse(s == "", NA, s)
## [1] ": 458" ": 1" NA
We can omit the ifelse line if "" is ok instead of NA.
4) strapply If the idea is really that there are some digits on the line and we want to return the numbers or NA then try this:
library(gsubfn)
strapply(test, "\\d+|$", as.numeric, simplify = TRUE)
## [1] 458 1 NA