Snakemake: how to use glob_wildcards properly if one wildcard appear multiple times in the path? - wildcard

I hope to use glob_wildcards for my two wildcards: "sample" and "ID"
but in the path: config["path"]+ "{sample}/map/filtered.{sample}.R1.clean.fq.gz.split/filtered.{sample}.R1.clean.id_{ID}.fq.gz"
{sample} appeared multiple times, what could I do?
the issue I have is the fact that for each of my sample, there are different number of IDs, how could I write the snakefile to make it ignore certain combinations of sample & ID?
Thank you very much!

Related

How to sort .csv files in R

I have one .csv file which i have imported into R. It contains a column with locations, some locations are repeated depending on how many times that location has been surveyed. I have another column with the total no. of plastic items.
I would like to add together the number of plastic items for locations that appear more than once, and create a separate column with the total no. of plastic and another column of the no. of times the location appeared.
I am unsure how to do this, any help will be much appreciated.
Using dplyr:
data %>%
group_by(location) %>%
mutate(TOTlocation=n(),TOTitems=sum(items))
And here's a base solution that does pretty much the same thing:
data[c("TOTloc","TOTitem")]<-t(sapply(data$location, function(x)
c(TOTloc=sum(data$location==x),
TOTitem=sum(data$items[data$location==x]))))
Note that in neither case do you need to sort anything - in dplyr you can use group_by to have each action done on only the part of the data set that belongs to a group determined by the contents of a certain column. In my base solution, I break down the locations list using sapply and then recalculate the TOTloc and TOTitem again for each row. This may not be a very efficient solution. A better solution will probably use split, but for some reason I couldn't make it work with my made up dataset, so maybe someone else can suggest how to best do that.

Using semi_join to find similarities but returns none mistakenly

I am trying to find the similar genes between two columns that I can later work with just the similar genes. Below is my code:
top100_1Beta <- data.frame(grp1_Beta$my_data.SYMBOL[1:100])
top100_2Beta<- data.frame(grp2_Beta$my_data.SYMBOL[1:100])
common100_Beta <- semi_join(top100_1Beta,top100_2Beta)`
When I run the code I get the following error:
Error: by required, because the data sources have no common variables
This is wrong since when I open top100_1Beta and top100_2Beta I can see at least the first few list the exact same genes: ATP2A1, SLMAP, MEOX2,...
I am confused on why then it's returning that no commonalities.
Any help would be greatly appreciated.
Thanks!
I don't think you need any form of *_join here; instead it seems you're looking for intersect
intersect(grp1_Beta$my_data.SYMBOL[1:100], grp2_Beta$my_data.SYMBOL[1:100])
This returns a vector of common entries amongst the first 100 entries of grp1_Beta$my_data.SYMBOL and grp1_Beta$my_data.SYMBOL.
Without a full working example, I'm guessing that your top100_1Beta and top100_2Beta dataframes do not have the same column names. They are probably grp1_Beta.my_data.SYMBOL.1.100. and grp2_Beta.my_data.SYMBOL.1.100.. This means the semi_join function doesn't know where to match the dataframes up. Renaming the columns should fix the issue.

Google Spreadsheet IF and AND

im trying to find an easy formula to do the following:
=IF(AND(H6="OK";H7="OK";H8="OK";H9="OK";H10="OK";H11="OK";);"OK";"X")
This actually works. But I want to apply to a range of cells within a column (H6:H11) instead of having to create a rule for each and every one of them... But trying as a range:
=IF(AND(H6:H11="OK";);"OK";"X")
Does not work.
Any insights?
Thanks.
=ArrayFormula(IF(AND(H6:H11="OK");"OK";"X"))
also works
arrayformulas work the same way they do in excel... they just need an ArrayFormula() around to work (will be automatically set when pressing Ctrl+Alt+Return like in excel)
In google sheets the formula is:
=ArrayFormula(IF(SUM(IF(H6:H11="OK";1;0))=6;"OK";"X"))
in excel:
=IF(SUM(IF(H6:H11="OK";1;0))=6;"OK";"X")
And confirm with Ctrl-Shift-Enter
This basically counts the number of times the said range is = to the criteria and compares it to the number it should be. So if the range is increased then increase the number 6 to accommodate.

How to randomly pick a number of combinations from all the combinations efficiently?

I know function combn can generate all the possible combinations. However, if the total number of members is large, this is really time-consuming and memory-consuming.
My goal is to randomly pick combinations from all the possible combinations. For example, I want 5000 distinct triple set of members from a pool of 3000 members. I think I don't need to generate all possible combinations and choose 5000 from them. But seems that R doesn't have a ready-to-use function to do this. So how to deal with this problem?
This is not exactly what you need but perhaps it can get you started:
library(data.table) #to make the table easier
members=1:3000;
X=data.table(RUN=1:5000)
X<-X[,as.list(sample(members, 3)),by=RUN]
This will create 3 new columns that are randomly selected from the members vector. See them as IDs of each member.
I would do a check to see how many as unique using:
X[duplicated(X, by=c('V1','V2','V3'))]
Is this helping you at all?

How to get R to use a certain dataset for multiple commands without usin attach() or appending data="" to every command

So I'm trying to manipulate a simple Qualtrics CSV, and I want to use colSums on certain columns of data, given a certain filter.
For example: within the .csv file called data, I want to get the sum of a few columns, and print them with certain labels (say choice1, choice2 etc). That is easy enough by itself:
firstqn<-data.frame(choice1=data$Q7_2,choice2=data$Q7_3,choice3=data$Q7_4);
secondqn<-data.frame(choice1=data$Q8_6,choice2=data$Q8_7,choice3=data$Q8_8)
print colSums(firstqn); print colSums(secondqn)
The problem comes when I want to repeat the above steps with different filters, - say, only the rows where gender==2.
The only way I know how is to create a new dataset data2 and replace data$ with data2$ in every line of the above code, such as:
data2<-(data[data$Q2==2,])
firstqn<-data.frame(choice1=data2$Q7_2,choice2=data2$Q7_3,choice3=data2$Q7_4);
however i have 6 choices for each of 5 questions and am planning to apply about 5-10 different filters, and I don't relish the thought of copy/pasting data2 and `data3' etc hundreds of times.
So my question is: Is there any way of getting R to reference data by default without using data$ in front of every variable name?
I can probably use attach() to achieve this, but i really don't want to:
data2<-(data[data$Q2==2,])
attach(data2)
firstqn<-data.frame(choice1=Q7_2,choice2=Q7_3,choice3=Q7_4);
detach(data2)
is there a command like attach() that would allow me to avoid using data$ in front of every variable, for a specified amount of code? Then whenever I wanted to create a new filter, I could just copy/paste the same code and change the first command (defining a new dataset).
I guess I'm looking for some command like with(data2, *insert multiple commands here*)
Alternatively, if anyone has a better way to do the above in an entirely different way please enlighten me - i'm not very proficient at R (yet).

Resources