trying to prune down a list of files - r

I have a list of files and I'm trying to extract all layer1_*.grd files. Is there a way of doing this in one grep expression?
lof <- c("layer1_1.grd", "layer1_1.gri", "layer1_2.grd", "layer1_2.gri",
"layer1_3.grd", "layer1_3.gri", "layer1_4.grd", "layer1_4.gri",
"layer1_5.grd", "layer1_5.gri", "layer2_1.grd", "layer2_1.gri",
"layer2_2.grd", "layer2_2.gri", "layer2_3.grd", "layer2_3.gri",
"layer2_4.grd", "layer2_4.gri", "layer2_5.grd", "layer2_5.gri",
"layer3_1.grd", "layer3_1.gri", "layer3_2.grd", "layer3_2.gri",
"layer3_3.grd", "layer3_3.gri", "layer3_4.grd", "layer3_4.gri",
"layer3_5.grd", "layer3_5.gri", "layer4_1.grd", "layer4_1.gri",
"layer4_2.grd", "layer4_2.gri", "layer4_3.grd", "layer4_3.gri",
"layer4_4.grd", "layer4_4.gri", "layer4_5.grd", "layer4_5.gri")
I tried doing this in two steps:
list.of.files <- list.files(pattern = c("1_"))
list.of.files <- list.of.files[grep(".grd", list.of.files)]
Can someone enlighten me how to do this with grep in one step? I naively tried passing list() and c() to the grep but, as you can imagine, it doesn't work.
list.of.files <- list.files()
list.of.files <- list.of.files[grep(list("1_", ".grd"), list.of.files)]

This should work for you:
> lof[grep("layer1_.*.grd", lof)]
[1] "layer1_1.grd" "layer1_2.grd" "layer1_3.grd" "layer1_4.grd" "layer1_5.grd"
Also, just to clarify your terminology: your list of files is not really a list; it's a character vector.

The stringr alternative is lof[str_detect(lof, "layer1_.*.grd")].
In fact, in this case you can be even more specific about the missing characters, so "layer1_[[:digit:]].grd" would work as the pattern here, and might be faster if lof is very long.

Related

paste specific text to strings that do not have it

I would like to paste "miR" to strings that do not have "miR" already, and skipping those that have it.
paste("miR", ....)
in
c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
out
c("miR-26b", "miR-26a", "miR-1297", "miR-4465", "miR-26b", "miR-26a")
One way could be by removing "miR" if it is present in the beginning of the string using sub and pasting it to every string irrespectively.
paste0("miR-", sub("^miR-","", x))
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
data
x <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
vec <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
sub("^(?!miR)(.*)$", "miR-\\1", vec, perl = T)
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
If you want to learn more:
type ?sub into R console
learn regex, have a closer look at negative look ahead, capturing groups LEARN REGEX
I've used perl = T because I get an error if I don't. READ MORE

Specify order of import for multiple tables in R

I'm trying to read in 360 data files in text format. I can do so using this code:
temp = list.files(pattern="*.txt")
myfiles = lapply(temp, read.table)
The problem I have is that the files are named as "DO_1, DO_2,...DO_360" and when I try to import the files into a list, they do not maintain this order. Instead I get DO_1, DO_10, etc. Is there a way to specify the order in which the files are imported and stored? I didn't see anything in the help pages for list.files or read.table. Any suggestions are greatly appreciated.
lapply will process the files in the order you have them stored in temp. So your goal is to sort them the way you actually think about them. Luckily there is the mixedsort function from the gtools package that does just the kind of sorting you're looking for. Here is a quick demo.
> library(gtools)
> vals <- paste("DO", 1:20, sep = "_")
> vals
[1] "DO_1" "DO_2" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7" "DO_8" "DO_9"
[10] "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17" "DO_18"
[19] "DO_19" "DO_20"
> vals <- sample(vals)
> sort(vals) # doesn't give us what we want
[1] "DO_1" "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17"
[10] "DO_18" "DO_19" "DO_2" "DO_20" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7"
[19] "DO_8" "DO_9"
> mixedsort(vals) # this is the sorting we're looking for.
[1] "DO_1" "DO_2" "DO_3" "DO_4" "DO_5" "DO_6" "DO_7" "DO_8" "DO_9"
[10] "DO_10" "DO_11" "DO_12" "DO_13" "DO_14" "DO_15" "DO_16" "DO_17" "DO_18"
[19] "DO_19" "DO_20"
So in your case you just want to do
library(gtools)
temp <- mixedsort(temp)
before your call to lapply that calls read.table.

incomplete list of csv file imported in R

I need to import a list of 36 csv files, but after running the code I get only 26 of them. Probably, 10 files have format problems. Is there a way in R to detect the 10 files that cannot be imported?
If you the file names in a list, you can use the following code:
all <- c("16048.txt", "16062.txt", "16066.txt", "16093.txt", "16095.txt", "16122.txt", "16241.txt", "16360.txt", "16380.txt", "16389.txt", "16510.txt", "16511.txt", "16701.txt", "16729.txt", "16735.txt", "16737.txt", "16761.txt", "16816.txt", "16867.txt", "16876.txt", "16880.txt", "16883.txt", "16884.txt", "16885.txt", "16893.txt", "16904.txt", "16906.txt", "16908.txt", "16929.txt", "16931.txt", "16938.txt", "16943.txt", "16959.txt", "16967.txt", "16968.txt", "16969.txt")
imp <- c("16761.txt", "16959.txt", "16884.txt", "16093.txt", "16883.txt", "16122.txt", "16906.txt", "16737.txt", "16968.txt", "16095.txt", "16062.txt", "16816.txt", "16360.txt", "16893.txt", "16885.txt", "16938.txt", "16048.txt", "16931.txt", "16876.txt", "16511.txt", "16969.txt", "16241.txt", "16967.txt", "16701.txt", "16380.txt", "16510.txt")
Where all is the list of filenames you need and imp is the imperfect result you got. You can get a list of the missing files with:
missing <- all[!all %in% imp]

How to Vectorize this R code Using Plyr, Apply, or Similar?

I wrote the following R code that identifies duplicate files in a directory. How can one vectorize the for-loop using the plyr package (or similar)? I would like to achieve a more idiomatic R solution than the one I came up with.
library("digest") # to compute the MD5 digest
test_dir = "/Users/user/Dropbox/kaggle/r_projects/test_photo"
filelist <- dir(test_dir, pattern = "JPG|AVI", recursive=TRUE,
all.files =TRUE, full.names=TRUE)
fl = list() #create and empty list to hold md5's and filenames
for (itm in filelist) {
file_digest = digest(itm, file=TRUE, algo="md5")
fl[[file_digest]]= c(fl[[file_digest]],itm)
}
fl
the output is ( using a small test directory):
> fl
$`5715b719723c5111b3a38a6ff8b7ca56`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480.JPG"
$`24fd4d7d252ca66c8d7a88b539c55112`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3481.JPG"
$`2a1d668c874dc856b9df0fbf3f2e81ec`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482.JPG"
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482 copy.JPG"
[4] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482.JPG"
I tried:
h=ldply(filelist, digest, file=TRUE, algo="md5")
h$filenames=filelist
but ended up with a unique row for every key value pair of (MD5, filename). I was not able to get the compact output desired.
(Background: As an exercise, I converted the python code presented by Raymond Hettinger in his PyCon AU 2011 keynote "What Makes Python Awesome". The slides are here: http://slidesha.re/WKkh9M . I was able to cut the LOC in half, but I think I can do better - and learn more - by vectorizing).
Here is a solution in base that is a little more concise:
md5s<-sapply(filelist,digest,file=TRUE,algo="md5")
split(filelist,md5s)
Here's one answer. First get the md5 and file names on to a data.frame with ldply. Then, create the list you desire with dlply.
fl <- ldply(seq_along(filelist), function(idx)
c(digest(filelist[idx], file=TRUE, algo="md5"),
filelist[idx]))
fl <- dlply(fl, .(V1), function(x) x$V2)

How to improve this code for getting pairwise?

It is a question build upon the previous question (http://stackoverflow.com/questions/6538448/r-how-to-write-a-loop-to-get-a-matrix).
It is different from the previous one, as more details is provided, and libraries and example file is provided according to comments from DWin. So, I submitted it as a new question. Could you mind to teach me how to modify this code further?
To load the necessary libraries:
source("http://bioconductor.org/biocLite.R")
biocLite()
My protseq.fasta file has the following contents:
>drugbank_target|1 Peptidoglycan synthetase ftsI (DB00303)
MVKFNSSRKSGKSKKTIRKLTAPETVKQNKPQKVFEKCFMRGRYMLSTVLILLGLCALVARAAYVQSINADTLSNEADKR
SLRKDEVLSVRGSILDRNGQLLSVSVPMSAIVADPKTMLKENSLADKERIAALAEELGMTENDLVKKIEKNSKSGYLYLA
RQVELSKANYIRRLKIKGIILETEHRRFYPRVEEAAHVVGYTDIDGNGIEGIEKSFNSLLVGKDGSRTVRKDKRGNIVAH
ISDEKKYDAQDVTLSIDEKLQSMVYREIKKAVSENNAESGTAVLVDVRTGEVLAMATAPSYNPNNRVGVKSELMRNRAIT
DTFEPGSTVKPFVVLTALQRGVVKRDEIIDTTSFKLSGKEIVDVAPRAQQTLDEILMNSSNRGVSRLALRMPPSALMETY
QNAGLSKPTDLGLIGEQVGILNANRKRWADIERATVAYGYGITATPLQIARAYATLGSFGVYRPLSITKVDPPVIGKRVF
SEKITKDIVGILEKVAIKNKRAMVEGYRVGVKTGTARKIENGHYVNKYVAFTAGIAPISDPRYALVVLINDPKAGEYYGG
AVSAPVFSNIMGYALRANAIPQDAEAAENTTTKSAKRIVYIGEHKNQKVN
>drugbank_target|3 Histidine decarboxylase (DB00114; DB00117)
MMEPEEYRERGREMVDYICQYLSTVRERRVTPDVQPGYLRAQLPESAPEDPDSWDSIFGDIERIIMPGVVHWQSPHMHAY
YPALTSWPSLLGDMLADAINCLGFTWASSPACTELEMNVMDWLAKMLGLPEHFLHHHPSSQGGGVLQSTVSESTLIALLA
ARKNKILEMKTSEPDADESCLNARLVAYASDQAHSSVEKAGLISLVKMKFLPVDDNFSLRGEALQKAIEEDKQRGLVPVF
VCATLGTTGVCAFDCLSELGPICAREGLWLHIDAAYAGTAFLCPEFRGFLKGIEYADSFTFNPSKWMMVHFDCTGFWVKD
KYKLQQTFSVNPIYLRHANSGVATDFMHWQIPLSRRFRSVKLWFVIRSFGVKNLQAHVRHGTEMAKYFESLVRNDPSFEI
PAKRHLGLVVFRLKGPNCLTENVLKEIAKAGRLFLIPATIQDKLIIRFTVTSQFTTRDDILRDWNLIRDAATLILSQHCT
SQPSPRVGNLISQIRGARAWACGTSLQSVSGAGDDPVQARKIIKQPQRVGAGPMKRENGLHLETLLDPVDDCFSEEAPDA
TKHKLSSFLFSYLSVQTKKKTVRSLSCNSVPVSAQKPLPTEASVKNGGSSRVRIFSRFPEDMMMLKKSAFKKLIKFYSVP
SFPECSSQCGLQLPCCPLQAMV
>drugbank_target|5 Glutaminase liver isoform, mitochondrial (DB00130; DB00142)
MRSMKALQKALSRAGSHCGRGGWGHPSRSPLLGGGVRHHLSEAAAQGRETPHSHQPQHQDHDSSESGMLSRLGDLLFYTI
AEGQERTPIHKFTTALKATGLQTSDPRLRDCMSEMHRVVQESSSGGLLDRDLFRKCVSSSIVLLTQAFRKKFVIPDFEEF
TGHVDRIFEDVKELTGGKVAAYIPQLAKSNPDLWGVSLCTVDGQRHSVGHTKIPFCLQSCVKPLTYAISISTLGTDYVHK
FVGKEPSGLRYNKLSLDEEGIPHNPMVNAGAIVVSSLIKMDCNKAEKFDFVLQYLNKMAGNEYMGFSNATFQSEKETGDR
NYAIGYYHEEKKCFPKGVDMMAALDLYFQLCSVEVTCESGSVMAATLANGGICPITGESVLSAEAVRNTLSLMHSCGMYD
FSGQFAFHVGLPAKSAVSGAILLVVPNVMGMMCLSPPLDKLGNSHRGTSFCQKLVSLFNFHNYDNLRHCARKLDPRREGA
EIRNKTVVNLLFAAYSGDVSALRRFALSAMDMEQKDYDSRTALHVAAAEGHIEVVKFLIEACKVNPFAKDRWGNIPLDDA
VQFNHLEVVKLLQDYQDSYTLSETQAEAAAEALSKENLESMV
>drugbank_target|6 Coagulation factor XIII A chain (DB00130; DB01839; DB02340)
SETSRTAFGGRRAVPPNNSNAAEDDLPTVELQGVVPRGVNLQEFLNVTSVHLFKERWDTNKVDHHTDKYENNKLIVRRGQ
SFYVQIDFSRPYDPRRDLFRVEYVIGRYPQENKGTYIPVPIVSELQSGKWGAKIVMREDRSVRLSIQSSPKCIVGKFRMY
VAVWTPYGVLRTSRNPETDTYILFNPWCEDDAVYLDNEKEREEYVLNDIGVIFYGEVNDIKTRSWSYGQFEDGILDTCLY
VMDRAQMDLSGRGNPIKVSRVGSAMVNAKDDEGVLVGSWDNIYAYGVPPSAWTGSVDILLEYRSSENPVRYGQCWVFAGV
FNTFLRCLGIPARIVTNYFSAHDNDANLQMDIFLEEDGNVNSKLTKDSVWNYHCWNEAWMTRPDLPVGFGGWQAVDSTPQ
ENSDGMYRCGPASVQAIKHGHVCFQFDAPFVFAEVNSDLIYITAKKDGTHVVENVDATHIGKLIVTKQIGGDGMMDITDT
YKFQEGQEEERLALETALMYGAKKPLNTEGVMKSRSNVDMDFEVENAVLGKDFKLSITFRNNSHNRYTITAYLSANITFY
TGVPKAEFKKETFDVTLEPLSFKKEAVLIQAGEYMGQLLEQASLHFFVTARINETRDVLAKQKSTVLTIPEIIIKVRGTQ
VVGSDMTVTVQFTNPLKETLRNVWVHLDGPGVTRPMKKMFREIRPNSTVQWEEVCRPWVSGHRKLIASMSSDSLRHVYGE
LDVQIQRRPSM
To load the data to R for the analysis, I have done:
require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("./protseq.fasta", strip.descs=TRUE)
To get the the pairwise numbers, as there are a total of 4 sequences, I have done:
number <-c(1:4); dat <- expand.grid(number,number, stringsAsFactors=FALSE)
datr <- dat[dat[,1] > dat[,2] , ]
In order to calculate the score one by one, I can do this:
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))
However, I have problem to add a new column as "score" to include all the score for each pairs of the proteins. I tried to do this, but did not work.
datr$score <- lapply(datr, 1, function(i) { x <- datr[i,1]; y<- datr[i,2]; score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))})
Could you mind to comments how to further improve it? Thanks DWin and diliop for wonderful solutions to my previous question.
Try:
datr$score <- sapply(1:nrow(datr), function(i) {
x <- datr[i,1]
y <- datr[i,2]
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100,gapOpening=0, gapExtension=-5))
})
To be able to reference your sequences better using their names, you might want to tidy up datr by doing the following:
colnames(datr) <- c("seq1id", "seq2id", "score")
datr$seq1name <- sapply(datr$seq1id, function(i) seqs[[i]]$desc)
datr$seq2name <- sapply(datr$seq2id, function(i) seqs[[i]]$desc)
Or if you just want to extract the accession IDs i.e. the contents of your parentheses, you could use stringr as such:
library(stringr)
datr$seq1name <- sapply(datr$seq2id, function(i) str_extract(seqs[[i]]$desc, "DB[0-9\\ ;DB]+"))
Hope this helps!

Resources