Flipping two sides of string - r

I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.

Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")

x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"

A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30

Related

How to insert a part of a value of a column into a new column

I have not been programming for that long and have now encountered a problem to which I have not yet been able to find a solution.
In my dataframe there is a column that contains several pieces of information. For example, one row looks like this:
sp|O94910|AGRL1_HUMAN
or like this
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN
Now I want to create a new column with the combination of digits between the two vertical bars.
For the upper example it would be O94910, for the lower Q13554; Q13555
I have already tried functions like str_extract_all, str_match or gsub. But nothing worked.
The "id" is the column I look at. It includes different combinations of digits. I need the one between the two |
> dput(head(anaDiff_PD_vs_CTRL$id, 10))
c("sp|O94910|AGRL1_HUMAN", "sp|P02763|A1AG1_HUMAN", "sp|P19652|A1AG2_HUMAN",
"sp|P25311|ZA2G_HUMAN", "sp|Q8NFZ8|CADM4_HUMAN", "sp|P08174|DAF_HUMAN",
"sp|Q15262|PTPRK_HUMAN", "sp|P78324|SHPS1_HUMAN;sp|Q5TFQ8|SIRBL_HUMAN;sp|Q9P1W8|SIRPG_HUMAN",
"sp|Q8N3J6|CADM2_HUMAN", "sp|P19021|AMD_HUMAN")>
With dplyr and stringr you can try...
library(dplyr)
library(stringr)
dat %>%
rowwise() %>%
mutate(dig = str_extract_all(col, "(?<=sp\\|)[A-Z0-9]+(?=\\|)"),
dig = paste0(dig, collapse = "; "))
#> # A tibble: 4 x 2
#> # Rowwise:
#> col dig
#> <chr> <chr>
#> 1 sp|Q8NFZ8|CADM4_HUMAN Q8NFZ8
#> 2 sp|94910|AGRL1_HUMAN 94910
#> 3 sp|O94910|AGRL1_HUMAN O94910
#> 4 sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN Q13554; Q13555
data
dat <- data.frame(col = c("sp|Q8NFZ8|CADM4_HUMAN", "sp|94910|AGRL1_HUMAN", "sp|O94910|AGRL1_HUMAN", "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"))
Created on 2022-02-02 by the reprex package (v2.0.1)
Here is a solution without tidyverse:
dat <- read.table(text = "
sp|Q8NFZ8|CADM4_HUMAN
sp|94910|AGRL1_HUMAN
sp|O94910|AGRL1_HUMAN
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN")
ids <- strsplit(dat$V1, ";")
ids <- lapply(ids, function(x) gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", x))
ids <- lapply(ids, function(x) paste(x, collapse="; "))
dat$newcol <- unlist(ids)
Even with tidyverse, I would define a helper function for more clarity:
extract_ids <- function(x) {
ids <- strsplit(x, ";")
ids <- map(ids, ~ gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", .))
ids <- map(ids, ~ paste(., collapse="; "))
unlist(ids)
}
dat <- dat %>% mutate(ids = extract_ids(V1))
This solution should help if you want to change your column names in a similar fashion:
library(tidyverse)
# create test data frame with column names "sp|O94910|AGRL1_HUMAN" and "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
col1 <- c(1,2,3,4,5)
col2 <- c(6,7,8,9,10)
df <- data.frame(col1, col2)
names(df)[1] <- "sp|O94910|AGRL1_HUMAN"
names(df)[2] <- "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
names <- as.data.frame((str_split(colnames(df), "\\|", simplify = TRUE))) # split the strings representing the column names seperated by "|" into a list
# remove all strings that contain less digits than letters or special characters
for(i in 1:nrow(names)) {
for(j in 1:ncol(names)){
if ( (str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[0-9]") >
str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[:alpha:]|[:punct:]") )){
names[i,j] <- names[i,j]
} else {
names[i,j] <- ""
}
}
}
# combine the list columns into a single column calles "colnames"
names <- names %>% unite("colnames", 1:5, na.rm = TRUE, remove = TRUE, sep = ";")
# remove all ";" separators at the start of the strings, the end of the strings, and series of ";" into a single ";"
for (i in 1:nrow(names)){
names[i,] <- str_replace(names[i,],"\\;+$", "") %>%
str_replace("^\\;+", "") %>%
str_replace("\\;{2}", ";")
}
# convert column with new names into a vector
new_names <- as.vector(names$colnames)
# replace old names with new names
names(df) <- new_names

How do I get the column number from a dataframe which contains specific strings?

I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]
You can use:
df[,match(z, substring(colnames(df), 1, 3))]
With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well
Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means
You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.
You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means

Matching columns in 2 data frames when numbers don't exactly match

How do I match two different data frames when the values I am comparing are not exactly the same?
I was thinking of using merge() but I am not sure.
Table1:
ID Value.1
10001 x
18273-9 y
12824/5/6/7 z
10283/5/9 d
Table2:
ID Value.2
10001 a
18274 b
12826 c
10289 u
How do I merge Table 1 and 2 based on ID?
Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?
You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:
df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4", "12824/5/6/7","10283/5/9"),
value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"), stringsAsFactors = F)
doIt <- function(df){
listAsDF <- function(l) {
x <- stack(setNames(l, temp$value))
names(x) <- c("ID", "value")
return(x)
}
Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
#1 cases when - present
temp <- df[grep("\\-", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
Base <- rbind(Base, temp)
#2 cases when / present
temp <- df[grep("\\/", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
Base <- rbind(Base, temp)
return(Base)
}
Then you can mergge the df2 and df1:
merge(doIt(df1), df2, by = "ID", all.x = T)
Hope this helps!
You could use the fuzzy string matching function "agrep" from base R.
df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"),
value=c("a","c","d","k"))
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"))
apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))
As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.
One option could consist in extracting the format of ID you want to keep. And then do your merge.
You can format your ID column as follow :
library(stringr)
library(dplyr)
If you want only the digits before any symbols
Table1 %>% mutate(ID = str_extract("[0-9]*"))
If you want to keep the first sequence of 5 digits
Table1 %>% mutate(ID = str_extract("[0-9]{5}"))
This answers your second question, but does not use the fuzzyjoin package

R Applying self made formatting function over data frame R

I am using R and I need to format the number within a dataframe, notably by imposing the number of digits before the decimal separator as well as after. E.g. 3.56 must become "0003,56000".
So I built my own function:
format <- function(x, nbr_before_comma, nbr_after_comma){
x= round(x, nbr_after_comma)
x = toString(x)
l = strsplit(x, "[.]")[[1]]
#print(l)
#print(nchar(l[2]))
before_comma = paste0(strrep("0",nbr_before_comma - nchar(l[1])),l[1])
after_comma = ifelse(length(l) > 1,
paste0(l[2],strrep("0",nbr_after_comma - nchar(l[2]))),
strrep("0", nbre_after_comma))
res = paste0(before_comma, ",", after_comma)
return(res)
}
Trying this on a single number will work. Now I am trying to apply this to a dataframe. Let's take the toy example:
df <- data.frame("a" = c(2.5,3.56,4.5))
I define moreprecisely what I want:
format44 <- function(x){
return(format(x,4,4))
}
I have tried several possibilities:
df[] <- lapply(df, format44)
with dplyr:
df <- df %>%
mutate(a = format44(a))
and finally:
df["a"] <- lapply(df["a"],format44)
None will work. actually, I get the same output everytime:
a
1 0002,5, 3
2 0002,5, 3
3 0002,5, 3
Any idea what the problem is ?
Use sprintf and then translate the decimal points to comma:
before <- after <- 4
fmt <- sprintf("%%0%d.%df", before + after + 1, after)
transform(df, a = chartr(".", ",", sprintf(fmt, a)))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000
or writing this with dplyr:
library(dplyr)
before <- after <- 4
df %>%
mutate(a = "%%0%d.%df" %>%
sprintf(before + after + 1, after) %>%
sprintf(a) %>%
chartr(".", ",", .))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000
In this case, mapply suits better you:
df$b <- mapply(format44, df$a)
You do not even need the format44 wrapper. You can use:
df$c <- mapply(format, df$a, 4,4)

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

Resources