Move "*" to new column in R - r

Hello I have a column in a data.frame, it has many rows, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"))
I want to make a new column "Species_new" where the "*" is moved to the end of the character string, e.g.,
df = data.frame("Species" = c("*Briza minor", "*Briza minor", "Wattle"),
"Species_new" = c("Briza minor*", "Briza minor*", "Wattle"))
Is there a way to do this using gsub? The manual example would take far too long as I have approximately 50,000 rows.
Thanks in advance

One option is to capture the * as a group and in the replacement reverse the backreferences
df$Species_new <- sub("^([*])(.*)$", "\\2\\1", df$Species)
df$Species_new
#[1] "Briza minor*" "Briza minor*" "Wattle"
NOTE: * is a metacharacter meaning 0 or more, so we can either escape (\\*) or place it in brackets ([]) to evaluate the raw character i.e. literal evaluation

Thanks so much for the quick response, I also found a workaround;
df$Species_new = sub("[*]","",df$Species, perl=TRUE)
differences = setdiff(df$Species,df$Species_new)
tochange = subset(df,df$Species == differences)
toleave = subset(df,!df$Species == differences)
tochange$Species_new = paste(tochange$Species_new, "*", sep = "")
df = rbind(tochange,toleave)

Related

separate distinct strings with common characters using stringr::str_detect() in R

i have the following sample character vector:
sample_dat <- c("Q2", "Q20", "Q21", "Q23_8_T", "Q21_fct", "Q2_fct7", "Q20_fct7_4", "Q2_fct7_4")
From this vector of strings, I want to isolate those that share in common the initial prefix using a regular expression so that I might be able to use it again in a function, such that the desired subset of strings for prefix = "Q2" would be the result of the following code snippet:
(desired_subset <- sample_dat[c(1, 6, 8)])
That is, the desired output should be c("Q2", "Q2_fct7", "Q2_fct7_4")
I tried using stringr::str_detect() to reproduce the desired_subset using a regular expression, but i am unable to have desired_subset[1] enter the result:
library(stringr)
sample_dat[str_detect(string = sample_dat, pattern = "Q2_")]
in the case above, too few results are returned., I am missing
"Q2" itself.
Whereas in the code below, too many results are returned. For example "Q20" and "Q21" are returned which is not what I want.
sample_dat[str_detect(string = sample_dat, pattern = "Q2")]
eventually, I'd like to use it in a function like so:
subset_str <- function(str, prefix){
substitute(prefix)
str_set <- str_detect(string = str, pattern = paste0(eval(prefix),'_'))
return(str[str_set])
}
such that
subset_str(sample_dat, "Q2") would return ONLY
c("Q2", "Q2_fct7", "Q2_fct7_4") and
subset_str(sample_dat, "Q20") would return ONLY
c("Q20", "Q20_fct7")
Perhaps there is someone who might be able to help me.
Thanks.
We can specify the pattern as the intended substring to match ("Q20") that is the start of the string (^) followed by either a _ or (|) it is the end ($) of the string
grep("^Q20(_|$)", sample_dat, value = TRUE)
#[1] "Q20" "Q20_fct7_4"
grep("^Q2(_|$)", sample_dat, value = TRUE)
#[1] "Q2" "Q2_fct7" "Q2_fct7_4"
which can be wrapped into a function
subset_str <- function(string, pattern){
grep(pattern, string, value = TRUE)
}
Or the same pattern in str_detect
library(stringr)
sample_dat[str_detect(string = sample_dat, pattern = "Q2(_|$)")]
#[1] "Q2" "Q2_fct7" "Q2_fct7_4"

Filter according to partial match of string variable in R

I have a data-frame with string variable column "disease". I want to filter the rows with partial match "trauma" or "Trauma". I am currently done the following using dplyr and stringr:
trauma_set <- df %>% filter(str_detect(disease, "trauma|Trauma"))
But the result also includes "Nontraumatic" and "nontraumatic". How can I filter only "trauma, Trauma, traumatic or Traumatic" without including nontrauma or Nontrauma? Also, is there a way I can define the string to detect without having to specify both uppercase and lowercase version of the string (as in both trauma and Trauma)?
If we want to specify the word boundary, use \\b at the start. Also, for different cases, we can use ignore_case = TRUE by wrapping with modifiers
library(dplyr)
library(stringr)
out <- df %>%
filter(str_detect(disease, regex("\\btrauma", ignore_case = TRUE)))
sum(str_detect(out$disease, regex("^Non", ignore_case = TRUE)))
#[1] 0
data
set.seed(24)
df <- data.frame(disease = sample(c("Nontraumatic", "Trauma",
"Traumatic", "nontraumatic", "traumatic", "trauma"), 50 ,
replace = TRUE), value = rnorm (50))
You were very close to a correct solution, you just needed to add the "start of string" anchor ^, as follows:
trauma_set <- df %>% filter(str_detect(disease, "^trauma|^Trauma"))

conditional str_replace based on matching regex within mutate?

For any entries of the column "district" that match regex("[:alpha:]{2}AL"), I would like to replace the "AL" with "01".
For example:
df <- tibble(district = c("NY14", "MT01", "MTAL", "PA10", "KS02", "NDAL", "ND01", "AL02", "AL01"))
I tried:
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
str_replace(district,"AL","01")))
and
df %>% mutate(district=replace(district,
str_detect(district, regex("[:alpha:]{2}AL")),
paste(str_sub(district, start = 1, end = 2),"01",sep = ""))
but there is a vectorization problem.
Is this ok?
str_replace_all(string=df$district,
pattern="(\\w{2})AL",
replacement="\\101")
I replaced the regex with \\w, a word character: https://www.regular-expressions.info/shorthand.html
I am using \\1 to indicate replace the string with the first captured region, which is captured in the (\\w{2}) so keep the first 2 letters then add the 01
You can change the replace to ifelse
ifelse( str_detect(df$district, regex("[:alpha:]{2}AL")),
str_replace(df$district,"AL","01"),df$district)

Replace multiple strings comprising of a different number of characters with one gsubfn()

Here Replace multiple strings in one gsub() or chartr() statement in R? it is explained to replace multiple strings of one character at in one statement with gsubfn(). E.g.:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", " " = ""), x)
# "doremig_k"
I would however like to replace the string 'doremi' in the example with ''. This does not work:
x <- "doremi g-k"
gsubfn(".", list("-" = "_", "doremi" = ""), x)
# "doremi g_k"
I guess it is because of the fact that the string 'doremi' contains multiple characters and me using the metacharacter . in gsubfn. I have no idea what to replace it with - I must confess I find the use of metacharacters sometimes a bit difficult to udnerstand. Thus, is there a way for me to replace '-' and 'doremi' at once?
You might be able to just use base R sub here:
x <- "doremi g-k"
result <- sub("doremi\\s+([^-]+)-([^-]+)", "\\1_\\2", x)
result
[1] "g_k"
Does this work for you?
gsubfn::gsubfn(pattern = "doremi|-", list("-" = "_", "doremi" = ""), x)
[1] " g_k"
The key is this search: "doremi|-" which tells to search for either "doremi" or "-". Use "|" as the or operator.
Just a more generic solution to #RLave's solution -
toreplace <- list("-" = "_", "doremi" = "")
gsubfn(paste(names(toreplace),collapse="|"), toreplace, x)
[1] " g_k"

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Resources