I have the following code:
for (fileName in fileNames) {
index <- "0"
if (grepl("_01", fileName, fixed = TRUE)) {
index <- "01"
}
if (grepl("_02", fileName, fixed = TRUE)) {
index <- "02"
}
}
and so on.
My filename is like "31231_sad_01.csv" or "31231_happy_01.csv".
All of my filenames are stored in a character vector fileNames. I loop through each file.
How can I find the past ending part of the filename aka 01 in this case or 02?
I tried using the code I mentioned and it always returns 1 for every value.
Try the following:
#suppose you have your file names in a character vector
fnames <- c("31231_sad_01.csv", "31231_happy_02.csv")
unlist(lapply(str_extract_all(fnames,"\\d+"),'[',2))
It would return a vector
[1] "01" "02"
Vectorized alternatives exist, there is no need for a loop.
To check if the last numeric part of filename ends with a specific number, here 01, we can first extract the numeric part, then run endsWith.
string <- c("31231_sad_01.csv", "bla_215.csv", "test_05.csv")
endsWith(stringr::str_extract(string, "([^_])*(?=.csv)"), "01")
#> [1] TRUE FALSE FALSE
An alternative way is to use sub to extract parts of the strings. Your examples show that the targeted index in each file name is always located after _ and before .csv. We can use this pattern in sub:
library(magrittr)
findex <- function(filename){
filename %>%
sub(".csv.*" , "", .) %>% #extract the part before ".csv"
sub(".*_" , "", .) # exctract the part after "_"
}
This method can be used for various length of the index.
Test:
findex("31231_sad_01.csv")
#[1] "01"
findex("31231_happy_02.csv")
#[1] "02"
findex("31231_happy_213.csv")
#[1] "213"
findex("31231_happy_15213.csv")
#[1] "15213"
Then, you can use lapply or vapply to the vector that contains all the names:
names <- c("31231_happy_1032.csv", "31231_happy_02.csv", "31231_happy_213.csv", "31231_happy_15213.csv")
lapply(names, findex)
#[[1]]
#[1] "1032"
#[[2]]
#[1] "02"
#[[3]]
#[1] "213"
#[[4]]
#[1] "15213"
vapply(names, findex, character(1))
#31231_happy_1032.csv 31231_happy_02.csv 31231_happy_213.csv
"1032" "02" "213"
#31231_happy_15213.csv
"15213"
In case you want to use only base R, this should work:
findex1 <- function(filename) sub(".*_" , "", sub(".csv.*" , "", filename))
vapply(names, findex1, character(1))
# 31231_happy_1032.csv 31231_happy_02.csv 31231_happy_213.csv
# "1032" "02" "213"
#31231_happy_15213.csv
# "15213"
Related
have been trying to programme a custom function that outputs the variable name as a string from an input object x that is a specific vector from a dataframe, i.e. in the form of df$vector , so that it function like this
function(iris$Species)
>"Species"
Currently I am doing this:
vector.name<-function(x){
require(stringr)
#convert df$variable into string
xname <- as.character(deparse(substitute(x)))
if (str_detect(xname,"$")==T) {
str_split(xname,"$")
}
}
but the results are unsatisfying
> vector.name(iris$Species)
[[1]]
[1] "iris$Species" ""
I have tried both strsplit(){base} and str_split(){stringr}, they both work normally for other ordinary alphabetic strings, e.g.
> str_split(as.character(deparse(substitute(iris$Species))),"S")
[[1]]
[1] "iris$" "pecies"
How do I extract "vector" from df$vector in a custom function then?
The $ is a metacharacter to match the end of string. Either escape (\\$) or wrap it inside square bracket ([$]) or use fixed to evaluate the character literally
vector.name<-function(x){
xname <- as.character(deparse(substitute(x)))
if(stringr::str_detect(xname,fixed("$"))) {
stringr::str_split(xname, fixed("$"))
}
}
-testing
vector.name(iris$Species)
[[1]]
[1] "iris" "Species"
Note that $ in the first str_detect returns TRUE and it is just a coincidence and nothing else i.e. $ by itself looks for the end of string and it matches in all the strings whether it is a blank or not
> str_detect("iris$Species", "$")
[1] TRUE
> str_detect("", "$")
[1] TRUE
Instead, it would be
> str_detect("iris$Species", "\\$")
[1] TRUE
> str_detect("", "\\$")
[1] FALSE
Similarly for the str_split, as it matches the end of the string, it returns the second element as blank
> str_split("iris$Species", "$")
[[1]]
[1] "iris$Species" ""
try this
Get <- function(x) {
x <- deparse(substitute(x))
gsub("^.+\\$","",x)
}
Get(iris$Species)
I have column names similar to the following
names(df_woe)
# [1] "A_FLAG" "woe.ABCD.binned" "woe.EFGHIJ.binned"
...
I would like to rename the columns by removing the "woe." and ".binned" sections, so that the following will be returned
names(df_woe)
# [1] "A_FLAG" "ABCD" "EFGHIJ"
...
I have tried substr(names(df_woe), start, stop) but I am unsure how to set variable start/stop arguments.
Another possible and readable regex can be to create groups and return the group after the first and before the second dot, i.e.
gsub("(.*\\.)(.*)\\..+", "\\2", names(df_woe))
#[1] "A_FLAG" "ABCD" "EFGH"
nam <- c("A_FLAG", "woe.ABCD.binned", "woe.EFGH.binned")
gsub("woe\\.|\\.binned", "", nam)
[1] "A_FLAG" "ABCD" "EFGH"
EDIT: a solution that deals with wierder cases such as woe..binned.binned
gsub("^woe\\.|\\.binned$", "", nam)
Another solution, using stringr package:
str_replace_all("woe.ABCD.binned", pattern = "woe.|.binned", replacement = "")
# [1] "ABCD"
There is a vector with a time value. How can I remove a colon and convert a text value to a numeric value. i.e. from "10:01:02" - character to 100102 - numeric. All that I could find is presented below.
> x <- c("10:01:02", "11:01:02")
> strsplit(x, split = ":")
[[1]]
[1] "10" "01" "02"
[[2]]
[1] "11" "01" "02"
If you want to do everything in one line, you can use the destring() function from taRifx to remove everything that isn't a number and convert the result to numeric.
taRifx::destring(x)
This will also work if some of your data's formatted in a different way, such as "10-01-02", though you may have to set the value of keep.
destring("10-10-10", keep = "0-9")
And if you don't want to have to install the taRifx package you can define the destring() function locally.
destring <- function(x, keep = "0-9.-")
{
return(as.numeric(gsub(paste("[^", keep, "]+", sep = ""),
"", x)))
}
We can use gsub to replace : with "". After that, use as.numeric to do the conversion.
x <- as.numeric(gsub(":", "", x, fixed = TRUE))
Or we can use the regex suggest by Soto
x <- as.numeric(gsub('\\D+', '', x))
Try with
x <- as.numeric(x)
and then to make sure
class(x)
I was manipulating my count-data (fcm) and had my Barcode ID's as column names in the format: TCGA.BH.A0DQ.11A.12R.A089.07 etc
I proceeded to use:
CountCol= colnames(fcm)
Barcode = strsplit(as.character(CountCol), ".", fixed=TRUE)
giving me a list of all the split character strings such as :
head(Barcode,2)
[[1]]
[1] "TCGA" "3C" "AAAU" "01A" "11R" "A41B" "07"
[[2]]
[1] "TCGA" "3C" "AALI" "01A" "11R" "A41B" "07"
My question is now how do I put only the first three elements together to make new column names separated by a "-" (i.e. TCGA-3C-AAAU for the first and so forth for the next ~1200 values)
I hope this was clear.
I tried a few methods but keep coming short of the correct solution.
try sapply
sapply(Barcode,function(x){paste(x[1:3],collapse="-")})
You could also use the purrrlibrary for a more simplified code:
library(purrr)
x <- c("TCGA", "3C", "AAAU", "01A", "11R", "A41B", "07" )
y <- c("TCGA", "3C", "AALI", "01A", "11R", "A41B", "07" )
z <- list(x, y)
purrr::map(z, ~paste(.[1:3], collapse = "-"))
[[1]]
[1] "TCGA-3C-AAAU"
[[2]]
[1] "TCGA-3C-AALI"
How can i grep all the gene names starting only with "Gm" from data1[,7].
I tried data2[grep("^Gm",data2$Genes),]; but it extract the entire row which starts with "Gm".
data1[,7] <-
[1] "Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5"
[2] "Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5"
[3] "Arhgap15,Gm22867"
One option would be to split the string (strsplit(..) by , and then extract words in the output (which is a list, so lapply can be used) that begin with "Gm" using grep. (^- denotes the beginning of word/string)
lapply(strsplit(Genes, ','), function(x) grep('^Gm', x, value=TRUE))
#[[1]]
#[1] "Gm23940"
#[[2]]
#[1] "Gm5852" "Gm5773" "Gm9116" "Gm9117"
#[[3]]
#[1] "Gm22867"
Or you could extract the words by stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(Genes, 'Gm[[:alnum:]]+')
Or if you need it as a vector, you can use unlist on the above output or use gsub to remove those words that don't begin with "Gm" (\\b(?!Gm)\\w+\\b) and ,', then usescan`.
scan(text=gsub('\\b(?!Gm)\\w+\\b|,', ' ',
Genes, perl=TRUE), what='', quiet=TRUE)
#[1] "Gm23940" "Gm5852" "Gm5773" "Gm9116" "Gm9117" "Gm22867"
Update
If you need to remove all the words starting with Gm
scan(text=gsub('\\bGm\\w+\\b|,', ' ', Genes, perl=TRUE),
what='', quiet=TRUE)
# [1] "Ighmbp2" "Mrpl21" "Cpt1a" "Mtl5" "Gal" "Ppp6r3"
# [7] "Lrp5" "Tdpoz4" "Tdpoz3" "Tdpoz5" "Arhgap15"
data
Genes <- c("Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5",
"Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5",
"Arhgap15,Gm22867")