Spitting Character String then Pasting it together - r

I was manipulating my count-data (fcm) and had my Barcode ID's as column names in the format: TCGA.BH.A0DQ.11A.12R.A089.07 etc
I proceeded to use:
CountCol= colnames(fcm)
Barcode = strsplit(as.character(CountCol), ".", fixed=TRUE)
giving me a list of all the split character strings such as :
head(Barcode,2)
[[1]]
[1] "TCGA" "3C" "AAAU" "01A" "11R" "A41B" "07"
[[2]]
[1] "TCGA" "3C" "AALI" "01A" "11R" "A41B" "07"
My question is now how do I put only the first three elements together to make new column names separated by a "-" (i.e. TCGA-3C-AAAU for the first and so forth for the next ~1200 values)
I hope this was clear.
I tried a few methods but keep coming short of the correct solution.

try sapply
sapply(Barcode,function(x){paste(x[1:3],collapse="-")})

You could also use the purrrlibrary for a more simplified code:
library(purrr)
x <- c("TCGA", "3C", "AAAU", "01A", "11R", "A41B", "07" )
y <- c("TCGA", "3C", "AALI", "01A", "11R", "A41B", "07" )
z <- list(x, y)
purrr::map(z, ~paste(.[1:3], collapse = "-"))
[[1]]
[1] "TCGA-3C-AAAU"
[[2]]
[1] "TCGA-3C-AALI"

Related

How to check if filename ends with a certain string? (R)

I have the following code:
for (fileName in fileNames) {
index <- "0"
if (grepl("_01", fileName, fixed = TRUE)) {
index <- "01"
}
if (grepl("_02", fileName, fixed = TRUE)) {
index <- "02"
}
}
and so on.
My filename is like "31231_sad_01.csv" or "31231_happy_01.csv".
All of my filenames are stored in a character vector fileNames. I loop through each file.
How can I find the past ending part of the filename aka 01 in this case or 02?
I tried using the code I mentioned and it always returns 1 for every value.
Try the following:
#suppose you have your file names in a character vector
fnames <- c("31231_sad_01.csv", "31231_happy_02.csv")
unlist(lapply(str_extract_all(fnames,"\\d+"),'[',2))
It would return a vector
[1] "01" "02"
Vectorized alternatives exist, there is no need for a loop.
To check if the last numeric part of filename ends with a specific number, here 01, we can first extract the numeric part, then run endsWith.
string <- c("31231_sad_01.csv", "bla_215.csv", "test_05.csv")
endsWith(stringr::str_extract(string, "([^_])*(?=.csv)"), "01")
#> [1] TRUE FALSE FALSE
An alternative way is to use sub to extract parts of the strings. Your examples show that the targeted index in each file name is always located after _ and before .csv. We can use this pattern in sub:
library(magrittr)
findex <- function(filename){
filename %>%
sub(".csv.*" , "", .) %>% #extract the part before ".csv"
sub(".*_" , "", .) # exctract the part after "_"
}
This method can be used for various length of the index.
Test:
findex("31231_sad_01.csv")
#[1] "01"
findex("31231_happy_02.csv")
#[1] "02"
findex("31231_happy_213.csv")
#[1] "213"
findex("31231_happy_15213.csv")
#[1] "15213"
Then, you can use lapply or vapply to the vector that contains all the names:
names <- c("31231_happy_1032.csv", "31231_happy_02.csv", "31231_happy_213.csv", "31231_happy_15213.csv")
lapply(names, findex)
#[[1]]
#[1] "1032"
#[[2]]
#[1] "02"
#[[3]]
#[1] "213"
#[[4]]
#[1] "15213"
vapply(names, findex, character(1))
#31231_happy_1032.csv 31231_happy_02.csv 31231_happy_213.csv
"1032" "02" "213"
#31231_happy_15213.csv
"15213"
In case you want to use only base R, this should work:
findex1 <- function(filename) sub(".*_" , "", sub(".csv.*" , "", filename))
vapply(names, findex1, character(1))
# 31231_happy_1032.csv 31231_happy_02.csv 31231_happy_213.csv
# "1032" "02" "213"
#31231_happy_15213.csv
# "15213"

How to insert back a character in a string at the exact position where it was originally

I have strings that have dots here and there and I would like to remove them - that is done, and after some other operations - these are also done, I would like to insert the dots back at their original place - this is not done. How could I do that?
library(stringr)
stringOriginal <- c("abc.def","ab.cd.ef","a.b.c.d")
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
I see that str_sub() may help, for example str_sub(stringModified[2], 3,2) <- "." gets me somewhere, but it is still far from the right place, and also I have no idea how to do it programmatically. Thank you for your time!
update
stringOriginal <- c("11.123.100","11.123.200","1.123.1001")
stringOriginalF <- as.factor(stringOriginal)
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
stringNumFac <- sort(as.numeric(stringModified))
stringi::stri_sub(stringNumFac[1:2], 3, 2) <- "."
stringi::stri_sub(stringNumFac[1:2], 7, 6) <- "."
stringi::stri_sub(stringNumFac[3], 2, 1) <- "."
stringi::stri_sub(stringNumFac[3], 6, 5) <- "."
factor(stringOriginal, levels = stringNumFac)
after such manipulation, I am able to order the numbers and convert them back to strings and use them later for plotting.
But since I wouldn't know the position of the dot, I wanted to make it programmatical. Another approach for factor ordering is also welcomed. Although I am still curious about how to insert programmatically back a character in a string at the exact position where it was originally.
This might be one of the cases for using base R's strsplit, which gives you a list, with a vector of substrings for each entry in your original vector. You can manipulate these with lapply or sapply very easily.
split_string <- strsplit(stringOriginal, "[.]")
#> split_string
#> [[1]]
#> [1] "11" "123" "100"
#>
#> [[2]]
#> [1] "11" "123" "200"
#>
#> [[3]]
#> [1] "1" "123" "1001"
Now you can do this to get the numbers
sapply(split_string, function(x) as.numeric(paste0(x, collapse = "")))
# [1] 11123100 11123200 11231001
And this to put the dots (or any replacement for the dots) back in:
sapply(split_string, paste, collapse = ".")
# [1] "11.123.100" "11.123.200" "1.123.1001"
And you could get the location of the dots within each element of your original vector like this:
lapply(split_string, function(x) cumsum(nchar(x) + 1))
# [[1]]
# [1] 3 7 11
#
# [[2]]
# [1] 3 7 11
#
# [[3]]
# [1] 2 6 11

Time conversion to number

There is a vector with a time value. How can I remove a colon and convert a text value to a numeric value. i.e. from "10:01:02" - character to 100102 - numeric. All that I could find is presented below.
> x <- c("10:01:02", "11:01:02")
> strsplit(x, split = ":")
[[1]]
[1] "10" "01" "02"
[[2]]
[1] "11" "01" "02"
If you want to do everything in one line, you can use the destring() function from taRifx to remove everything that isn't a number and convert the result to numeric.
taRifx::destring(x)
This will also work if some of your data's formatted in a different way, such as "10-01-02", though you may have to set the value of keep.
destring("10-10-10", keep = "0-9")
And if you don't want to have to install the taRifx package you can define the destring() function locally.
destring <- function(x, keep = "0-9.-")
{
return(as.numeric(gsub(paste("[^", keep, "]+", sep = ""),
"", x)))
}
We can use gsub to replace : with "". After that, use as.numeric to do the conversion.
x <- as.numeric(gsub(":", "", x, fixed = TRUE))
Or we can use the regex suggest by Soto
x <- as.numeric(gsub('\\D+', '', x))
Try with
x <- as.numeric(x)
and then to make sure
class(x)

union() function does not work properly when taking union of vector with other data type

I have variable which is vector and contain the row names. I want to take the unnion of this vector with the row name of other matrix, but when I do this, it does not work properly. basically it put all things together and does not care about the duplicates,....
Here is my effort:
step 1: put the names in vector, which I read it from list of matrix :
name<-c()
name<-lapply(ismr0, function(x){
name<-union(name, rownames(x))
return(name)
})
> length(name)
[1] 733
>
Second step which does not work properly;
rn <- union(rownames(ismr0[[1]]), name)
> length(rn)
[1] 1180
>
> ismr0[[1]][1:4,]
mature RPM
MIMAT0000062 mature 49791.5560
MIMAT0000063 mature 92858.1285
MIMAT0000064 mature 10418.8532
MIMAT0000065 mature 404.7618
>
But I would expected to have length 733, because row names of ismr0[[1]] is subset of the names in name variable .
Would someone help me to solve this problem ?
As you guessed in comments, you are using union on character vector and list. If we need to get all unique rownames from list then try this example:
#dummy data
a<-matrix(1:4,ncol=1)
b<-matrix(1:4,ncol=1)
c<-matrix(1:4,ncol=1)
rownames(a) <- letters[c(2,3,5,7)]
rownames(b) <- letters[c(2,4,5,7)]
rownames(c) <- letters[c(2,3,6,7)]
ismr0 <-list(a,b,c)
#get unique names
name <- unique(unlist(lapply(ismr0,rownames)))
#check with union
rn <- union(rownames(ismr0[[1]]), name)
length(name)==length(rn)
You don't get what you expect because lapply returns a list. I ran an example of a list with 3 data.frames and it gave me :
[[1]]
[1] "l1" "l2" "l3" "l4" "l5" # first df rownames
[[2]]
[1] "l6" "l7" "l8" "l9" "l10" # second df rownames
[[3]]
[1] "l11" "l12" "l13" "l14" "l15" # third df rownames
which is a list.
Then, the line union(rownames(ismr0[[1]]), name) adds the elements of name to the list, which doesn't contain those single elements and you get something like :
[[1]]
[1] "l1" "l2" "l3" "l4" "l5"
[[2]]
[1] "l6" "l7" "l8" "l9" "l10"
[[3]]
[1] "l11" "l12" "l13" "l14" "l15"
[[4]]
[1] "l1"
[[5]]
[1] "l2"
You need to use sapply, which returns a vector instead of a list.

Extracting values after pattern

A beginner question...
I have a list like this:
x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
(but many more lines)
I need to extract what is between "bb=" and ",". I.e. I want:
x21
g-25
26
Having read many similar questions here, I suppose it is stringr with str_extract I should use, but somehow I can't get it to work. Thanks for all help.
/Chris
strapply in the gsubfn package can do that. Note that [^,]* matches a string of non-commas.
strapply extracts the back referenced portion (the part within parentheses):
> library(gsubfn)
> strapply(x, "bb=([^,]*)", simplify = TRUE)
[1] "x21" "g-25" "26"
If there are several x vectors then provide them in a list like this:
> strapply(list(x, x), "bb=([^,]*)")
[[1]]
[1] "x21" "g-25" "26"
[[2]]
[1] "x21" "g-25" "26"
An option using regexpr:
> temp = regexpr('bb=[^,]*', x)
> substr(x, temp + 3, temp + attr(temp, 'match.length') - 1)
[1] "x21" "g-25" "26"
Here's one solution using the base regex functions in R. First we use strsplit to split on the comma. Then we use grepl to filter only the items that start with bb= and gsub to extract all the characters after bb=.
> x <- c("aa=v12, bb=x21, cc=f35", "xx=r53, bb=g-25, yy=h48", "nn=u75, bb=26, gg=m98")
> y <- unlist(strsplit(x , ","))
> unlist(lapply(y[grepl("bb=", y)], function(x) gsub("^.*bb=(.*)", "\\1", x)))
[1] "x21" "g-25" "26"
It looks like str_replace is the function you are after if you want to go that route:
> str_replace(y[grepl("bb=",y)], "^.*bb=(.*)", "\\1")
[1] "x21" "g-25" "26"
Read it in with commas as separators and take the second column:
> x.split <- read.table(textConnection(x), header=FALSE, sep=",", stringsAsFactors=FALSE)[[2]]
[1] " bb=x21" " bb=g-25" " bb=26"
Then remove the "bb="
> gsub("bb=", "", x.split )
[1] " x21" " g-25" " 26"

Resources