split paired samples based on substring

split paired samples based on substring - r

I have two groups of paired samples that could be separated by the first two letters. I would like to make two groups based on the pairing using something like [tn][abc].
Example of paired samples:
nb-008 ta-008
na015 ta-015
data:
> colnames(data)
"nb-008" "nb-014" "na015" "na-018" "ta-008" "tc-014" "ta-015" "ta-018"
patient <- factor(sapply(str_split(colnames(data), '[tn][abc]'), function(x) x[[1]]))

We can create a grouping variable with sub. We match the pattern of 2 characters (..) from the beginning of the string (^) followed by - (if present), followed by one or more characters (.*) that we capture as a group (inside the brackets), and replace by the backreference (\\1). This can be used to split the column names.
split(colnames(data), sub('^..-?(.*)', '\\1', colnames(data))))
#$`008`
#[1] "nb-008" "ta-008"
#$`014`
#[1] "nb-014" "tc-014"
#$`015`
#[1] "na015" "ta-015"
#$`018`
#[1] "na-018" "ta-018"
data
v1 <- c("nb-008", "nb-014", "na015", "na-018",
"ta-008", "tc-014", "ta-015", "ta-018" )
set.seed(24)
data <- setNames(as.data.frame(matrix(sample(0:8, 8*5,
replace=TRUE), ncol=8)), v1)

Related

How to group files in a list based on name?

I have 4 files:
MCD18A1.A2001001.h15v05.061.2020097222704.hdf
MCD18A1.A2001001.h16v05.061.2020097221515.hdf
MCD18A1.A2001002.h15v05.061.2020079205554.hdf
MCD18A1.A2001002.h16v05.061.2020079205717.hdf
And I want to group them by name (date: A2001001 and A2001002) inside a list, something like this:
[[MCD18A1.A2001001.h15v05.061.2020097222704.hdf, MCD18A1.A2001001.h16v05.061.2020097221515.hdf], [MCD18A1.A2001002.h15v05.061.2020079205554.hdf, MCD18A1.A2001002.h16v05.061.2020079205717.hdf]]
I did this using Python, but I don't know how to do with R:
# Seperate files by date
MODIS_files_bydate = [list(i) for _, i in itertools.groupby(MODIS_files, lambda x: x.split('.')[1])]

Is this what you are looking for?
g <- sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", s)
split(s, g)
#$A2001001
#[1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf"
#[2] "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
#
#$A2001002
#[1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf"
#[2] "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
regex explained
The regex is divided in three parts.
^[^\\.]*\\.
^ first circumflex marks the beginning of the string;
^[^\\.] at the beginning, a class negating a dot (the second ^). The dot is a meta-character and, therefore, must be escaped, \\.;
the sequence with no dots at the beginning repeated zero or more times (*);
the previous sequence ends with a dot, \\..
([^\\.]+) is a capture group.
[^\\.] the class with no dots, like above;
[^\\.]+ repeated at least one time (+).
\\..*$"
\\. starting with one dot
\\..*$ any character repeated zero or more times until the end ($).
What sub is replacing is the capture group, what is between parenthesis, by itself, \\1. This discards everything else.
Data
s <- "
MCD18A1.A2001001.h15v05.061.2020097222704.hdf
MCD18A1.A2001001.h16v05.061.2020097221515.hdf
MCD18A1.A2001002.h15v05.061.2020079205554.hdf
MCD18A1.A2001002.h16v05.061.2020079205717.hdf"
s <- scan(text = s, what = character())

How would you like the outcome organized?
This is a solution:
files <- c("MCD18A1.A2001001.h15v05.061.2020097222704.hdf",
"MCD18A1.A2001001.h16v05.061.2020097221515.hdf",
"MCD18A1.A2001002.h15v05.061.2020079205554.hdf",
"MCD18A1.A2001002.h16v05.061.2020079205717.hdf")
unique_date <- unique(sub("^[^\\.]*\\.([^\\.]+)\\..*$", "\\1", files))
# (credit to Rui Barradas for the nice regular expression)
grouped_files <- lapply(unique_date, function(x){files[grepl(x, files)]})
names(grouped_files) <- unique_date
> grouped_files
# $A2001001
# [1] "MCD18A1.A2001001.h15v05.061.2020097222704.hdf" "MCD18A1.A2001001.h16v05.061.2020097221515.hdf"
# $A2001002
# [1] "MCD18A1.A2001002.h15v05.061.2020079205554.hdf" "MCD18A1.A2001002.h16v05.061.2020079205717.hdf"

Count the number of hyphens in a name using R

I have a data set with a certain amount of names. How can I count the number of names with at least one hyphen using R?

We can use str_count to get the number of hyphens and then count by creating a logical vector and get the sum
library(stringr)
sum(str_count(v1, "-") > 0)

In base R, we can use grepl
sum(grepl('-', df$Name))
Or with grep
length(grep('-', df$Name))
Using a reproducble example,
df <- data.frame(Name = c('name1-name2', 'name1name2',
'name1-name2-name3', 'name2name3'))
sum(grepl('-', df$Name))
#[1] 2
length(grep('-', df$Name))
#[1] 2

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.

This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.

Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

Subset a character vector in R based on a specific pattern

I have a vector of character ids, as rownames of a data frame in R. The rownames have the following pattern:
head(foo)
[1] "ENSG00000197372 (ZNF675)" "ENSG00000112624 (GLTSCR1L)"
[3] "ENSG00000151320 (AKAP6)" "ENSG00000139910 (NOVA1)"
[5] "ENSG00000137449 (CPEB2)" "ENSG00000004779 (NDUFAB1)"
I would like to somehow subset the above rownames (~700 entries) in order to keep only the gene symbols in the parenthesis part-i.e. ZNF675-and drop the rest part: is this possible through a function like gsub ?

We can use sub to match characters that are not (, then capture the characters inside the ( which is not a ) and replace it with the backreference (\\1) of the captured group
row.names(foo) <- sub("^[^(]+\\(([^)]+).*", "\\1", row.names(foo))
row.names(foo)
#[1] "ZNF675" "GLTSCR1L" "AKAP6" "NOVA1" "CPEB2" "NDUFAB1"
Or using str_extract from stringr
library(stringr)
str_extract(row.names(foo), "(?<=\\()[^)]+")
data
foo <- data.frame(col1 = rnorm(6))
row.names(foo) <- c("ENSG00000197372 (ZNF675)",
"ENSG00000112624 (GLTSCR1L)", "ENSG00000151320 (AKAP6)",
"ENSG00000139910 (NOVA1)",
"ENSG00000137449 (CPEB2)", "ENSG00000004779 (NDUFAB1)")

Replace element in vector based on first letter of character string

Consider the vectors below:
ID <- c("A1","B1","C1","A12","B2","C2","Av1")
names <- c("ALPHA","BRAVO","CHARLIE","AVOCADO")
I want to replace the first character of each element in vector ID with vector names based on the first letter of vector names. I also want to add a _0 before each number between 0:9.
Note that the elements Av1 and AVOCADO throw things off a bit, especially with the lowercase v in Av1.
The result should look like this:
res <- c("ALPHA_01","BRAVO_01","CHARLIE_01","ALPHA_12","BRAVO_02","CHARLIE_02", "AVOCADO_01")
I know it should be done with regex but I've been trying for 2 days now and haven't got anywhere.

We can use gsubfn.
library(gsubfn)
#remove the number part from 'ID' (using `sub`) and get the unique elements
nm1 <- unique(sub("\\d+", "", ID))
#using gsubfn, replace the non-numeric elements with the matching
#key/value pair in the replacement
#finally format to add the "_" with sub
sub("(\\d+)$", "_0\\1", gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"
The (\\d+) indicates one or more numeric elements, and (\\D+) is one or more non-numeric elements. We are wrapping it within the brackets to capture as a group and replace it with the backreference (\\1 - as it is the first backreference for the captured group).
Update
If the condition would be to append 0 only to those 'ID's that have numbers less than 10, then we can do this with a second gsubfn and sprintf
gsubfn("(\\d+)", ~sprintf("_%02d", as.numeric(x)),
gsubfn("(\\D+)", as.list(setNames(names, nm1)), ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_12"
#[5] "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Doing this via base R, we can search for second character being V (as in AVOCADO) and substring 2 characters if that's true or 1 character if not. This will capture both AVOCADO and ALPHA. We then match those substrings with the letters extracted from ID (also convert toupper to capture Av with AV). Finally paste _0 along with the number found in each ID
paste0(names[match(toupper(sub('\\d+', '', ID)),
ifelse(substr(names, 2, 2) == 'V', substr(names, 1, 2),
substr(names, 1, 1)))],'_0', sub('\\D+', '', ID))
#[1] "ALPHA_01" "BRAVO_01" "CHARLIE_01" "ALPHA_02" "BRAVO_02" "CHARLIE_02" "AVOCADO_01"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

split paired samples based on substring - r

Related

How to group files in a list based on name?

Count the number of hyphens in a name using R

find alphanumeric elements in vector

Subset a character vector in R based on a specific pattern

Replace element in vector based on first letter of character string

Categories

Resources