In R, I have a single string with multiple entries such as:
mydata <- c("(first data entry) (second data entry) (third data entry) ")
I want to insert the pipe symbol "|" between the entries as an item delimiter, ending up with the following list:
"(first data entry)|(second data entry)|(third data entry)"
Not all of the mydata rows are containing the same amount of entries. If mydata contains 0 or just 1 entry, then no "|" pipe symbol is required.
I've tried the following without success:
newdata <- paste(mydata, collapse = "|")
Thanks for your help!
You do not need a regex if you have consistent )+1 space+( pattern.
You can simply use
gsub(") (", ")|(", mydata, fixed=TRUE)
If your strings contain variable amount of spaces, tabs, etc., you can use
gsub("\\)\\s*\\(", ")|(", mydata)
gsub("\\)[[:space:]]*\\(", ")|(", mydata)
stringr::str_replace_all(mydata, "\\)\\s*\\(", ")|(")
Here, \)\s*\( pattern matches a ) (escaped because ) is a special regex metacharacter), then zero or more whitespaces, and then a (.
See the regex demo.
If there is always one or more whitespaces between parentheses, use \s+ instead of \s*.
This will replace the spaces with the "|" in your string. If you need more complex rules use regex with gsub.
gsub(") (",")|(", yourString)
I think you could use the following solution too:
gsub("(?<=\\))\\s+(?=\\()", "|", mydata, perl = TRUE)
[1] "(first data entry)|(second data entry)|(third data entry) "
Related
A field from my dataset has some of it's observations to start with a "." e.g ".TN34AB1336"instead of "TN34AB1336".
I've tried
truck_log <- truck_log %>%
filter(BookingID != "WDSBKTP49392") %>%
filter(BookingID != "WDSBKTP44502") %>%
mutate(GpsProvider = str_replace(GpsProvider, "NULL", "UnknownGPS")) %>%
mutate(vehicle_no = str_replace(vehicle_no, ".TN34AB1336", "TN34AB1336"))
the last mutate command in my code worked but there are more of such issues in another field e.g "###TN34AB1336" instead of "TN34AB1336".
So I need a way of automating the process such that all observations that doesn't start with a character is trimmed from the left by a single command in R.
I've attached a picture of a filtered field from spreadsheet to make the question clearer.
We can use regular expressions to replace anything up to the first alphanumeric character with ""to remove everything that is not a Number/Character from the beginning of a string:
names <- c("###TN34AB1336",
".TN34AB1336",
",TN654835A34",
":+?%TN735345")
stringr::str_replace(names,"^.+?(?=[[:alnum:]])","") # Matches up to, but not including the first alphanumeric character and replaces with ""
[1] "TN34AB1336" "TN34AB1336" "TN654835A34" "TN735345"
``
You can use sub.
s <- c(".TN34AB1336", "###TN34AB1336")
sub("^[^A-Z]*", "", s)
#[1] "TN34AB1336" "TN34AB1336"
Where ^ is the start of the string, [^A-Z] matches everything what is not A, B, C, ... , Z and * matches it 0 to n times.
I have 4 values in the list
c("JSMITH_WWWFRecvd2001_asof_20220901.xlsx", "WSMITH_AMEXRecvd2002_asof_20220901.xlsx",
"PSMITH_WWWFRecvd2003_asof_20220901.xlsx", "QSMITH_AMEXRecvd2004_asof_20220901.xlsx")
I would like my outcome to be
"wwwf_01","amex_02","wwwf_03","amex_04"
You can use sub:
tolower(sub('.+_(.+)Recvd[0-9][0-9](..).+', '\\1_\\2', x))
Something like this would work. You can extract the string you want with str_extract() make it lower case with tolower() and paste the formatted counter to the end of the string with a "_" separator =.
paste(tolower(stringr::str_extract(x,"WWWF|AMEX" )), sprintf("%02d",seq_along(x)), sep = "_")
I am beginner programmer in R.
I have "cCt/cGt" and I want to extract C and G and write it like C>G.
test ="cCt/cGt"
str_extract(test, "[A-Z]+$")
Try this:
gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"
Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.
You seem to look for a mutation in two concatenated strings, this function should solve your problem:
extract_mutation <- function(text){
splitted <- strsplit(text, split = "/")[[1]]
pos <- regexpr("[[:upper:]]", splitted)
uppercases <- regmatches(splitted, pos)
mutation <- paste0(uppercases, collapse = ">")
return(mutation)
}
If the two base exchanges are always at the same index, you could also return the position if you're interested:
position <- pos[1]
return(list(mutation, position))
instead of the return(mutation)
You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.
test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])
Output
[1] "C>G"
See an R demo.
An exact match for the whole string could be:
^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$
I need to export information from a string into different columns.
More specifically the content of the brackets within the string;
Lets say I have a string
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
What I am trying to output is a vector with the contents of the brackets, if there is a comma save them as separate bracketed strings, and remove parentheses.
e.g.
tmp <- function(a)
Result
tmp
"[K89]" , "[K96]", "[N-Term]", "[S87]", "[S93]"
My approach so far:
pattern <- "(\\[.*?\\])"
hits <- gregexpr(pattern, a)
matches <- regmatches(a, hits)
unlisted_matches <- unlist(matches)
Results
"[K89; K96]" "[N-Term]" "[S87(100); S93(100)]"
This does give me the brackets but still doesn't split the terms. And for any reason I am not able to efficiently separate the ";" terms.
Here's a way using the tidyverse :
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
library(tidyverse)
a %>%
# extract between square, brackets, not keeping brackets, and unlist
str_extract_all("(?<=\\[).*?(?=\\])") %>%
unlist() %>%
# remove round brackets and content
str_replace_all("\\(.*?\\)", "") %>%
# split by ";" and unlist
str_split("; ") %>%
unlist() %>%
# put the brackets back
str_c("[",.,"]")
#> [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
You may use
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
pattern <- "(?:\\G(?!^)(?:\\([^()]*\\))?\\s*;\\s*|\\[)\\K[^][;()]+"
matches <- regmatches(a, gregexpr(pattern, a, perl=TRUE))
unlisted_matches <- paste0("[", unlist(matches),"]")
unlisted_matches
## => [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
See the R demo and the regex demo.
Pattern details
(?:\G(?!^)(?:\([^()]*\))?\s*;\s*|\[) - either the end of the previous successful match (\G(?!^)) followed with any substring inside round parentheses (optional, see (?:\([^()]*\))?) and then a ; enclosed with optional 0+ whitespaces (see \s*;\s*) or a [ char
\K - match reset operator discarding all text matched so far
[^][;()]+ - one or more chars other than [, ], ;, ( and ).
The paste0("[", unlist(matches),"]") part wraps the matches with square brackets.
I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"