Selective listing (ls) of files - ls

I have a list of files like:
(in general, it's of the format /data/logs/var_YYYYMMDDHHMMSS.log)
/data/logs/var_20121002103531.log
/data/logs/var_20121002104215.log
/data/logs/var_20121002120001.log
I've tried in vain to list all the files which DO NOT have 12 as HH.
i.e. input:
/data/logs/var_20121002103531.log
/data/logs/var_20121002104215.log
/data/logs/var_20121002120001.log # THIS IS TO BE OMITTED
/data/logs/var_20121002140001.log
Required Output:
/data/logs/var_20121002103531.log
/data/logs/var_20121002104215.log
/data/logs/var_20121002140001.log
I tried
ls -1tr /data/logs/var_20121002^[12]
ls -1tr /data/logs/var_20121002![11]
but, not getting what i expect.
Can someone please pinpoint where I'm going wrong?

Related

define SAMPLE for different dir name and sample name in snakemake code

I have written a snakemake code to run bwa_map. Fastq files are with different folder name and different sample name (paired end). It shows error as 'SAMPLES' is not defined. Please help.
Error:
$snakemake --snakefile rnaseq.smk mapped_reads/EZ-123-B_IGO_08138_J_2_S101_R2_001.bam -np
*NameError in line 2 of /Users/singhh5/Desktop/tutorial/rnaseq.smk:
name 'SAMPLES' is not defined
File "/Users/singhh5/Desktop/tutorial/rnaseq.smk", line 2, in *
#SAMPLE DIRECTORY
fastq
Sample_EZ-123-B_IGO_08138_J_2
EZ-123-B_IGO_08138_J_2_S101_R1_001.fastq.gz
EZ-123-B_IGO_08138_J_2_S101_R2_001.fastq.gz
Sample_EZ-123-B_IGO_08138_J_4
EZ-124-B_IGO_08138_J_4_S29_R1_001.fastq.gz
EZ-124-B_IGO_08138_J_4_S29_R2_001.fastq.gz
#My Code
expand("~/Desktop/{sample}/{rep}.fastq.gz", sample=SAMPLES)
rule bwa_map:
input:
"data/genome.fa",
"fastq/{sample}/{rep}.fastq"
conda:
"env.yaml"
output:
"mapped_reads/{rep}.bam"
threads: 8
shell:
"bwa mem {input} | samtools view -Sb -> {output}"
The specific error you are seeing is because the variable SAMPLES isn't set to anything before you use it in expand.
Some other issues you may run into:
Output file is missing the {sample} wildcard.
The value of threads isn't passed into bwa or samtools
You should place your expand into the input directive of the first rule in your snakefile, typically called all to properly request the files from bwa_map.
You aren't pairing your reads (R1 and R2) in bwa.
You should look around stackoverflow or some github projects for similar rules to give you inspiration on how to do this mapping.

Snakemake: how to use one integer from list each call as input to script?

I'm trying to practice writing workflows in snakemake.
The contents of my Snakefile:
configfile: "config.yaml"
rule get_col:
input:
expand("data/{file}.csv",file=config["datname"])
output:
expand("output/{file}_col{param}.csv",file=config["datname"],param=config["cols"])
params:
col=config["cols"]
script:
"scripts/getCols.R"
The contents of config.yaml:
cols:
[2,4]
datname:
"GSE3790_expression_data"
My R script:
getCols=function(input,output,col) {
dat=read.csv(input)
dat=dat[,col]
write.csv(dat,output,row.names=F)
}
getCols(snakemake#input[[1]],snakemake#output[[1]],snakemake#params[['col']])
It seems like both columns are being called at once. What I'm trying to accomplish is one column being called from the list per output file.
Since the second output never gets a chance to be created (both columns are used to create first output), snakemake throws an error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 3 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Job completed successfully, but some output files are missing.
On a slightly unrelated note, I thought I could write the input as:
'"data/{file}.csv"'
But that returns:
WildcardError in line 4 of /Users/rebecca/Desktop/snakemake-tutorial/practice/Snakefile:
Wildcards in input files cannot be determined from output files:
'file'
Any help would be much appreciated!
Looks like you want to run your Rscript twice per file, once for every value of col. In this case, the rule needs to be called twice as well.
The use of expand is also a bit too much here, in my opinion. expand fills your wildcards with all possible values and returns a list of the resulting files. So the output for this rule would be all possible combinations between files and cols, which the simple script can not create in one run.
This is also the reason why file can not be inferred from the output - it gets expanded there.
Instead, try writing your rule easier for just one file and column and expand on the resulting output, in a rule which needs this output as an input. If you generated the final output of your workflow, put it as input in a rule all to tell the workflow what the ultimate goal is.
rule all:
input:
expand("output/{file}_col{param}.csv",
file=config["datname"], param=config["cols"])
rule get_col:
input:
"data/{file}.csv"
output:
"output/{file}_col{param}.csv"
params:
col=lambda wc: wc.param
script:
"scripts/getCols.R"
Snakemake will infer from rule all (or any other rule to further use the output) what needs to be done and will call the rule get_col accordingly.

Is there a way to check where R is 'stuck' within a for loop? (R)

I am using system() to run several files iteratively through a program via CMD. It deposits each outputs into a sub-directory designated for specifically and only that input file. So # of inputs is exactly equal to the number of output directories/outputs.
My code works for the first iteration, but I can see in the console that it won't move on to the second file after completing the first. The stop sign remains active so I know R is still 'running', but since the for loop environment is unique I can't really tell what it's stuck on. It just stays like this for hours. Therefore I'm not sure how to begin to diagnose the issue I'm having. Is there a way of tracing what happened after cancelling the code, for example?
If your curious, the code looks like this btw. I don't know how to make it reproducible, so I just commented each line:
for (i in 1:length(flist)) {
##flist is a vector of character strings. Each
row of characters is both the name of the input file and the name of the
output directory
setwd(paste0(solutions_dir, "\\", flist[i]))
#sets the appropriate dir
system(paste0(program_dir,"\\program.exe I=",
file_dir, "\\", flist[i], " O=",solutions_dir, "\\", flist[i],
"\\solv"))
##line that inputs program's exe file and the appropriate input/output
locations
}

Unix remove old files based on date from file name

I have filenames in a directory like:
ACCT_GA12345_2015-01-10.xml
ACCT_GA12345_2015-01-09.xml
ACCT_GDC789g_2015-01-09.xml
ACCT_GDC567g_2015-01-09.xml
ACCT_GDC567g_2015-01-08.xml
ACCT_GCC7894_2015-01-01.xml
ACCT_GCC7894_2015-01-02.xml
ACCT_GAC7884_2015-02-01.xml
ACCT_GAC7884_2015-01-01.xml
I want to have only the latest file in the folder. The latest file can be found using only the file name (NOT the date stamp). For example ACCT 12345 has files from 1/10 & 1/09. I need to delete 1/09 file and have only 1/10 file, for ACCT 789g there is only one file so I have to have that file, and ACCT 567g the latest file is 1/09 so I have to remove 1/08 and have 1/09. So the combination for latest file should be ACCT & Max date for that ACCT.
I would need the final list of files as:
ACCT_GA12345_2015-01-10.xml
ACCT_GDC789g_2015-01-09.xml
ACCT_GDC567g_2015-01-09.xml
ACCT_GCC7894_2015-01-02.xml
ACCT_GAC7884_2015-02-01.xml
Can someone help me with this command in unix? Any help is appreciated
I'd do something like this.... to test start with ls command, when you get what you want to delete, then do rm.
ls ACCT_{GDC,GA1}*-{09,10}.xml
this will list any GDC or GA1 files that end in 09 or 10. You can play with combinations and different values until you have the right set of files showing that you want deleted. once you to just change ls to rm and you should be golden.
With some more info I could help you out. To test this out I did:
touch ACCT_{GDC,GA1}_{01..10}_{05..10}.xml
this will make 56 different dummy files with different combinations. Make a directory, run this command, and get your hands dirty. That is the best way to learn linux cli. Also 65% of commands you need, you will learn, understand, use then never use again...so learn how to teach yourself how to use man pages and setup a spot to play around in.

UNIX - Finding all empty files in source directory & finding files edited X days ago

The following is from a school assignment.
The question asks
"What command enables you to find all empty files in your source
directory."
My answer; find -size 0 However, my instructor says that my answer is incorrect. The only hint (Regarding the entirety of the assignment) he gives me is "...minor errors such as missing a file name or outputting too much information" I was thinking, perhaps I should include the source directory within my find command.
I've been trying to figure this out for the past few hours. I've referenced my textbook and according to that I should be correct.
There's some other questions I'm having similar issues with. I've wracked my brain with this for hours. I just don't know. I don't understand what I'm doing wrong.
Since your assignment was to find all empty files in your source directory, the following command will do exactly what you want:
find . -size 0
Notice the dot (.) to tell the command to search in the current folder.
For other folders, you replace the "dot" with the folder you want.

Resources