Is there a way to split camel case strings in R?
I have attempted:
string.to.split = "thisIsSomeCamelCase"
unlist(strsplit(string.to.split, split="[A-Z]") )
# [1] "this" "s" "ome" "amel" "ase"
string.to.split = "thisIsSomeCamelCase"
gsub("([A-Z])", " \\1", string.to.split)
# [1] "this Is Some Camel Case"
strsplit(gsub("([A-Z])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Looking at Ramnath's and mine I can say that my initial impression that this was an underspecified question has been supported.
And give Tommy and Ramanth upvotes for pointing out [:upper:]
strsplit(gsub("([[:upper:]])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this" "Is" "Some" "Camel" "Case"
Here is one way to do it
split_camelcase <- function(...){
strings <- unlist(list(...))
strings <- gsub("^[^[:alnum:]]+|[^[:alnum:]]+$", "", strings)
strings <- gsub("(?!^)(?=[[:upper:]])", " ", strings, perl = TRUE)
return(strsplit(tolower(strings), " ")[[1]])
}
split_camelcase("thisIsSomeGood")
# [1] "this" "is" "some" "good"
Here's an approach using a single regex (a Lookahead and Lookbehind):
strsplit(string.to.split, "(?<=[a-z])(?=[A-Z])", perl = TRUE)
## [[1]]
## [1] "this" "Is" "Some" "Camel" "Case"
Here is a one-liner using the gsubfn package's strapply. The regular expression matches the beginning of the string (^) followed by one or more lower case letters ([[:lower:]]+) or (|) an upper case letter ([[:upper:]]) followed by zero or more lower case letters ([[:lower:]]*) and processes the matched strings with c (which concatenates the individual matches into a vector). As with strsplit it returns a list so we take the first component ([[1]]) :
library(gsubfn)
strapply(string.to.split, "^[[:lower:]]+|[[:upper:]][[:lower:]]*", c)[[1]]
## [1] "this" "Is" "Camel" "Case"
I think my other answer is better than the follwing, but if only a oneliner to split is needed...here we go:
library(snakecase)
unlist(strsplit(to_parsed_case(string.to.split), "_"))
#> [1] "this" "Is" "Some" "Camel" "Case"
The beginnings of an answer is to split all the characters:
sp.x <- strsplit(string.to.split, "")
Then find which string positions are upper case:
ind.x <- lapply(sp.x, function(x) which(!tolower(x) == x))
Then use that to split out each run of characters . . .
Here an easy solution via snakecase + some tidyverse helpers:
install.packages("snakecase")
library(snakecase)
library(magrittr)
library(stringr)
library(purrr)
string.to.split = "thisIsSomeCamelCase"
to_parsed_case(string.to.split) %>%
str_split(pattern = "_") %>%
purrr::flatten_chr()
#> [1] "this" "Is" "Some" "Camel" "Case"
Githublink to snakecase: https://github.com/Tazinho/snakecase
Related
I have a dataset say
x <- c('test/test/my', 'et/tom/cat', 'set/eat/is', 'sk / handsome')
I'd like to remove everything before (including) the last slash, the result should look like
my cat is handsome
I googled this code which gives me everything before the last slash
gsub('(.*)/\\w+', '\\1', x)
[1] "test/test" "et/tom" "set/eat" "sk / tie"
How can I change this code, so that the other part of the string after the last slash can be shown?
Thanks
You can use basename:
paste(trimws(basename(x)),collapse=" ")
# [1] "my cat is handsome"
Using strsplit
> sapply(strsplit(x, "/\\s*"), tail, 1)
[1] "my" "cat" "is" "handsome"
Another way for gsub
> gsub("(.*/\\s*(.*$))", "\\2", x) # without 'unwanted' spaces
[1] "my" "cat" "is" "handsome"
Using str_extract from stringr package
> library(stringr)
> str_extract(x, "\\w+$") # without 'unwanted' spaces
[1] "my" "cat" "is" "handsome"
You can basically just move where the parentheses are in the regex you already found:
gsub('.*/ ?(\\w+)', '\\1', x)
You could use
x <- c('test/test/my', 'et/tom/cat', 'set/eat/is', 'sk / handsome')
gsub('^(?:[^/]*/)*\\s*(.*)', '\\1', x)
Which yields
[1] "my" "cat" "is" "handsome"
To have it in one sentence, you could paste it:
(paste0(gsub('^(?:[^/]*/)*\\s*(.*)', '\\1', x), collapse = " "))
The pattern here is:
^ # start of the string
(?:[^/]*/)* # not a slash, followed by a slash, 0+ times
\\s* # whitespaces, eventually
(.*) # capture the rest of the string
This is replaced by \\1, hence the content of the first captured group.
I need to extract a string that spans across multiple lines on an object.
The objetc:
> text <- paste("abc \nd \ne")
> cat(text)
abc
d
e
With str_extract_all I can extract all the text between ‘a’ and ‘c’, for example.
> str_extract_all(text, "a.*c")
[[1]]
[1] "abc"
Using the function ‘regex’ and the argument ‘multiline’ set to TRUE, I can extract a string across multiple lines. In this case, I can extract the first character of multiple lines.
> str_extract_all(text, regex("^."))
[[1]]
[1] "a"
> str_extract_all(text, regex("^.", multiline = TRUE))
[[1]]
[1] "a" "d" "e"
But when I try the to extract "every character between a and d" (a regex that spans across multiple lines), the output is "character(0)".
> str_extract_all(text, regex("a.*d", multiline = TRUE))
[[1]]
character(0)
The desired output is:
“abcd”
How to get it with stringr?
dplyr:
library(dplyr)
library(stringr)
data.frame(text) %>%
mutate(new = lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
text new
1 abc \nd \ne abcd
Here we use the character class \\w, which does not include the new line metacharacter \n. The negative lookahead (?!e) makes sure the e is not matched.
base R:
unlist(lapply(str_extract_all(text, "(?!e)\\w"), paste0, collapse = ""))
[1] "abcd"
str_remove_all(text,"\\s\\ne?")
[1] "abcd"
OR
paste0(trimws(strsplit(text, "\\ne?")[[1]]), collapse="")
[1] "abcd"
The anwers above remove line breaks. So, a two step approach can work to get the desired output 'abcd'.
1 - Use str_remove_all or gsub to remove the line breaks (in this case, also removing blank spaces).
2 - Use str_extract_all to get the desired output ('abcd' in this case).
> text %>%
+ str_remove_all("\\s\\n") %>%
+ str_extract_all("a.*d")
[[1]]
[1] "abcd"
Short regex reference:
\n - new line (return)
\s - any whitespace
\r - carriage return
Update:
In base R to get the desired output abcd:
text <- gsub("[\r\n]|[[:blank:]]", "", text)
substr(text,1, nchar(text)-1)
[1] "abcd"
First answer:
We can use gsub:
gsub("[\r\n]|[[:blank:]]", "", text)
[1] "abcde"
I have column names similar to the following
names(df_woe)
# [1] "A_FLAG" "woe.ABCD.binned" "woe.EFGHIJ.binned"
...
I would like to rename the columns by removing the "woe." and ".binned" sections, so that the following will be returned
names(df_woe)
# [1] "A_FLAG" "ABCD" "EFGHIJ"
...
I have tried substr(names(df_woe), start, stop) but I am unsure how to set variable start/stop arguments.
Another possible and readable regex can be to create groups and return the group after the first and before the second dot, i.e.
gsub("(.*\\.)(.*)\\..+", "\\2", names(df_woe))
#[1] "A_FLAG" "ABCD" "EFGH"
nam <- c("A_FLAG", "woe.ABCD.binned", "woe.EFGH.binned")
gsub("woe\\.|\\.binned", "", nam)
[1] "A_FLAG" "ABCD" "EFGH"
EDIT: a solution that deals with wierder cases such as woe..binned.binned
gsub("^woe\\.|\\.binned$", "", nam)
Another solution, using stringr package:
str_replace_all("woe.ABCD.binned", pattern = "woe.|.binned", replacement = "")
# [1] "ABCD"
I have a dataset say
x <- c('test/test/my', 'et/tom/cat', 'set/eat/is', 'sk / handsome')
I'd like to remove everything before (including) the last slash, the result should look like
my cat is handsome
I googled this code which gives me everything before the last slash
gsub('(.*)/\\w+', '\\1', x)
[1] "test/test" "et/tom" "set/eat" "sk / tie"
How can I change this code, so that the other part of the string after the last slash can be shown?
Thanks
You can use basename:
paste(trimws(basename(x)),collapse=" ")
# [1] "my cat is handsome"
Using strsplit
> sapply(strsplit(x, "/\\s*"), tail, 1)
[1] "my" "cat" "is" "handsome"
Another way for gsub
> gsub("(.*/\\s*(.*$))", "\\2", x) # without 'unwanted' spaces
[1] "my" "cat" "is" "handsome"
Using str_extract from stringr package
> library(stringr)
> str_extract(x, "\\w+$") # without 'unwanted' spaces
[1] "my" "cat" "is" "handsome"
You can basically just move where the parentheses are in the regex you already found:
gsub('.*/ ?(\\w+)', '\\1', x)
You could use
x <- c('test/test/my', 'et/tom/cat', 'set/eat/is', 'sk / handsome')
gsub('^(?:[^/]*/)*\\s*(.*)', '\\1', x)
Which yields
[1] "my" "cat" "is" "handsome"
To have it in one sentence, you could paste it:
(paste0(gsub('^(?:[^/]*/)*\\s*(.*)', '\\1', x), collapse = " "))
The pattern here is:
^ # start of the string
(?:[^/]*/)* # not a slash, followed by a slash, 0+ times
\\s* # whitespaces, eventually
(.*) # capture the rest of the string
This is replaced by \\1, hence the content of the first captured group.
A beginner question...
I have a list like this:
x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
(but many more lines)
I need to extract what is between "bb=" and ",". I.e. I want:
x21
g-25
26
Having read many similar questions here, I suppose it is stringr with str_extract I should use, but somehow I can't get it to work. Thanks for all help.
/Chris
strapply in the gsubfn package can do that. Note that [^,]* matches a string of non-commas.
strapply extracts the back referenced portion (the part within parentheses):
> library(gsubfn)
> strapply(x, "bb=([^,]*)", simplify = TRUE)
[1] "x21" "g-25" "26"
If there are several x vectors then provide them in a list like this:
> strapply(list(x, x), "bb=([^,]*)")
[[1]]
[1] "x21" "g-25" "26"
[[2]]
[1] "x21" "g-25" "26"
An option using regexpr:
> temp = regexpr('bb=[^,]*', x)
> substr(x, temp + 3, temp + attr(temp, 'match.length') - 1)
[1] "x21" "g-25" "26"
Here's one solution using the base regex functions in R. First we use strsplit to split on the comma. Then we use grepl to filter only the items that start with bb= and gsub to extract all the characters after bb=.
> x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
> y <- unlist(strsplit(x , ","))
> unlist(lapply(y[grepl("bb=", y)], function(x) gsub("^.*bb=(.*)", "\\1", x)))
[1] "x21" "g-25" "26"
It looks like str_replace is the function you are after if you want to go that route:
> str_replace(y[grepl("bb=",y)], "^.*bb=(.*)", "\\1")
[1] "x21" "g-25" "26"
Read it in with commas as separators and take the second column:
> x.split <- read.table(textConnection(x), header=FALSE, sep=",", stringsAsFactors=FALSE)[[2]]
[1] " bb=x21" " bb=g-25" " bb=26"
Then remove the "bb="
> gsub("bb=", "", x.split )
[1] " x21" " g-25" " 26"