R How to perform math operation on Regular Expression matches

R How to perform math operation on Regular Expression matches - r

I'm working on a data frame that has non-detects (with different decimal separators), missing and measured values.
I want to replace the non detects with half of the value after the less sign (<1 becomes 1/2=0.5).
1) I convert to charactes the imported dataframe.
df = data.frame(value=c("NA", "1.2", "<1.0", "<6,6"))
1) convert factor to character
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
2) I replace all "," to "."
pattern = ","
grep(pattern, df, value = TRUE)
df <- data.frame(lapply(df, function(x) {gsub(pattern=pattern, replacement=".", x, perl = TRUE)}))
3) I can find all non-detecs and I can replace it with the value after the less sign
pattern = "(^<)(\\d+)"
grep(pattern, df, value = TRUE)
df <- data.frame(lapply(df, function(x) {gsub(pattern=pattern, replacement="\\d", x, perl = TRUE)}))
I can't find how to perform math operation to the replacement string matched, something as:
replacement = as.character((as.numeric("\\2"))/2)

You can use the following code in Step 2:
df$value = gsub(",", ".", df$value, fixed = TRUE)
It will replace literal commas with literal dots in the value column.
Then, you may use the gsubfn package to match and manipulate substrings matched with regex:
> library(gsubfn)
> df$value = gsubfn("^<(\\d*\\.?\\d+)", ~ as.numeric(x)/2, df$value)
> df
value
1 NA
2 1.2
3 0.5
4 3.3
Here, ^<(\\d*\\.?\\d+) will match < at the start of the string and \\d*\\.?\\d+ pattern will match and capture into Group 1 any float/integer value and will divide it by 2 later in the callback function.

Related

R: change round brace interval number by subtracting 1 from the value

I have a data which is like this:
abc <- data.frame( a = c("[100-150)", "[150, 200)"))
I want to alter it to make it like this:
abc <- data.frame(a = c("100-149", "150-199"))
I know how to replace the brackets:
abc$a <- lapply(abc$a, gsub, pattern = "[", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = "]", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = ")", replacement = "", fixed = TRUE)
It is the subtraction of 1 number from the end that is the problem.
Is there a way to do this?
Please note this is just an example, in reality my data has a column like this which is about 2000 rows.

An option with gsubfn. We extract the numbers (\\d+) after the - or , convert it to numeric subtract 1 and paste with -
library(gsubfn)
gsubfn("[-,] ?(\\d+)", ~ paste0("-", as.numeric(x) - 1) , as.character(abc$a))
#[1] "[100-149)" "[150-199)"

Flipping two sides of string

I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.

Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")

x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"

A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!

Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))

Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"

Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

Extract rows from data frame which have matches from vector, but matches must be all the way at the end of string in value

I have a data frame like the following:
sampleid <- c("patient_sdlkfjd_2354_CSF_CD19+", "control_sdlkfjd_2632_CSF_CD8+", "control_sdlkfjd_2632_CSF")
values = rnorm(3, 8, 3)
df <- data.frame(sampleid, values)
I also have a vector like the following:
matches <- c("632_CSF_CD8+", "632_CSF").
I want to extract rows in this data frame which contain the matches at the end of the value in the sampleid column. From this example, you can see why the end of string is important,as I have two samples which contain "632_CSF," but they are distinct samples. If I chose to change matches to only:
matches <- c("632_CSF").
Then I want only the third row of the data frame to be outputted, because this is the only one where this matches at the end of the sampleid.
How can this be achieved?
Thanks!

Just use $ in your pattern to indicate that it occurs at the end of the string.
grep("632_CSF$", sampleid, value=TRUE)
[1] "control_sdlkfjd_2632_CSF"

You can make this with stringr and some manipulations.
You need to encode regex, it's done with quotemeta function.
Next step would be to append $ to ensure the match is in the end of the string and then concatenate all matches into one with regex OR - |.
And then it should be used with str_detect to get boolean indices.
library(stringr)
# taken from here
# https://stackoverflow.com/a/14838753/1030110
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
matches_with_end <- sapply(matches, function(x) { paste0(quotemeta(x), '$') })
joined_matches <- paste(matches_with_end, collapse = '|')
ind <- str_detect(df$sampleid, joined_matches)
# [1] FALSE TRUE TRUE
df[ind, ]
# sampleid values
# 2 control_sdlkfjd_2632_CSF_CD8+ 10.712634
# 3 control_sdlkfjd_2632_CSF 7.001628

Suggest making your dataset more regular.
library(tidyverse)
df_regular <- df %>%
separate(
sampleid,
into = c("patient_type",
"test_number",
"patient_group",
"patient_id"),
extra = "merge") %>%
mutate(patient_id = str_pad(patient_id, 9, side = c("left"), pad = "0"))
df_regular
df_regular %>%
filter(patient_group %in% "2632" & patient_id %in% "000000CSF")

How to remove '.' from column names in a dataframe?

My dataframe which I read from a csv file has column names like this
abc.def, ewf.asd.fkl, qqit.vsf.addw.coil
I want to remove the '.' from all the names and convert them to
abcdef, eqfasdfkl, qqitvsfaddwcoil.
I tried using the sub command sub(".","",colnames(dataframe)) but this command took out the first letter of each column name and the column names changed to
bc.def, wf.asd.fkl, qit.vsf.addw.coil
Anyone know another command to do this. I can change the column name one by one, but I have a lot of files with 30 or more columns in each file.
Again, I want to remove the "." from all the colnames. I am trying to do this so I can use "sqldf" commands, which don't deal well with "."
Thank you for your help

1) sqldf can deal with names having dots in them if you quote the names:
library(sqldf)
d0 <- read.csv(text = "A.B,C.D\n1,2")
sqldf('select "A.B", "C.D" from d0')
giving:
A.B C.D
1 1 2
2) When reading the data using read.table or read.csv use the check.names=FALSE argument.
Compare:
Lines <- "A B,C D
1,2
3,4"
read.csv(text = Lines)
## A.B C.D
## 1 1 2
## 2 3 4
read.csv(text = Lines, check.names = FALSE)
## A B C D
## 1 1 2
## 2 3 4
however, in this example it still leaves a name that would have to be quoted in sqldf since the names have embedded spaces.
3) To simply remove the periods, if DF is a data frame:
names(DF) <- gsub(".", "", names(DF), fixed = TRUE)
or it might be nicer to convert the periods to underscores so that it is reversible:
names(DF) <- gsub(".", "_", names(DF), fixed = TRUE)
This last line could be alternatively done like this:
names(DF) <- chartr(".", "_", names(DF))

UPDATE dplyr 0.8.0
As of dplyr 0.8 funs() is soft deprecated, use formula notation.
a dplyr way to do this using stringr.
library(dplyr)
library(stringr)
data <- data.frame(abc.def = 1, ewf.asd.fkl = 2, qqit.vsf.addw.coil = 3)
renamed_data <- data %>%
rename_all(~str_replace_all(.,"\\.","_")) # note we have to escape the '.' character with \\
Make sure you install the packages with install.packages().
Remember you have to escape the . character with \\. in regex, which functions like str_replace_all use, . is a wildcard.

To replace all the dots in the names you'll need to use gsub, rather than sub, which will only replace the first occurrence.
This should work.
test <- data.frame(abc.def = NA, ewf.asd.fkl = NA, qqit.vsf.addw.coil = NA)
names(test) <- gsub( ".", "", names(test), fixed = TRUE)
test
abcdef ewfasdfkl qqitvsfaddwcoil
1 NA NA NA

You can also try:
names(df) = gsub(pattern = ".", replacement = "", x = names(df))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R How to perform math operation on Regular Expression matches - r

Related

R: change round brace interval number by subtracting 1 from the value

Flipping two sides of string

Selecting multiple columns using Regular Expressions

Extract rows from data frame which have matches from vector, but matches must be all the way at the end of string in value

How to remove '.' from column names in a dataframe?

Categories

Resources